CN102163222A - Information search sequencing method based on index association relation - Google Patents
Information search sequencing method based on index association relation Download PDFInfo
- Publication number
- CN102163222A CN102163222A CN 201110083624 CN201110083624A CN102163222A CN 102163222 A CN102163222 A CN 102163222A CN 201110083624 CN201110083624 CN 201110083624 CN 201110083624 A CN201110083624 A CN 201110083624A CN 102163222 A CN102163222 A CN 102163222A
- Authority
- CN
- China
- Prior art keywords
- index
- document
- model
- retrieval
- literature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention discloses an information search sequencing method based on an index association relation, which belongs to the field of information analysis and aid decision making. The method comprises the following steps of: creating a document database and an index library, and creating the association between the document database and the index library; taking related search terms of each document as the indexes of the document to form a document model of the document; before searching, forming a search model with the set of all the search terms, namely the indexes, provided by users; and calculating the similarity between the search model and the document model of each document in the document database, sequencing the documents in a descending order, and providing the sequenced documents for the users as the final search result. The information search sequencing method based on the index association relation has the benefit that the interference of the errors on the result can be weakened under the circumstance of taking the mistaken indexes as reasoning conditions, by integrating the prior association relation between the indexes, the correct indexes can participate in the reasoning operation according to the wrong index conditions, and consequently, the correctness and the interference immunity of the reasoning can be implemented.
Description
Technical field
The present invention relates to a kind of sort method, belong to information analysis and aid decision making field result for retrieval.
Background technology
When searching document,, especially when literature search is carried out in certain subject or field, occur because the term improper use through regular meeting in order to obtain result for retrieval more accurately, or user's subjective description deviation and the result for retrieval that produces is not accurate.In fact, also has incidence relation between the term, as above the next or apposition etc., according to the incidence relation between the term, can judge term set that the user proposes and the index that has of waiting the to read up the literature similarity between gathering, whether whether being used for obtaining the document is that the user is required, or relevant with user view, thereby result for retrieval is sorted, to improve retrieval accuracy.
Summary of the invention
The object of the present invention is to provide a kind of information retrieval sort method, to solve public's (at the crowd of professional knowledge scarcity or speech habits and statement variation) fast and accurately because the result for retrieval that mistake or inapt description term cause has error based on the index incidence relation.By the relationship maps of index to document, realize the ordering of result for retrieval, a series of result for retrieval of degree of correlation maximum are provided to the user.Be particularly useful for close or similar information being retrieved relevancy ranking at professional range.
A kind of information retrieval sort method based on the index incidence relation of the present invention comprises the steps:
Step 1: replenish with these basic terms according to the canonical name of term and to have the vocabulary of going up incidence relations such as the next, coordination, with basic term and the formation element of the vocabulary that replenishes, and set up and preserve the incidence relation between per two indexes in the index storehouse as the index storehouse; Incidence relation between described per two indexes comprises the next, coordination incidence relation; Wherein said hyponymy comprises the subordinate relation between the index, and described apposition comprises synonym, nearly justice, similarity relation.
Step 2: the term that every piece of document is relevant constitutes the model of literature α=(a of the document as the index of the document with the set of the index that every piece of document was had
1, a
2..., a
k..., a
m), wherein m is the number of the index that has of the document; With the formation element of each model of literature as document databse;
Step 3: each model of literature is configured to a document vector, and construction method is:
The weights of all indexes that comprise with document model constitute document vector
Wherein
Be index a in the document model
kWeights, a is drawn in its list of values indicating
kAnd the big more then correlation degree of the correlation degree between the document A, weights is high more; As preferably, index a in the described model of literature
kWeights
Be according to this index a
kThe frequency of occurrences and/or occur that the position sets in advance in the document.
Step 4: before retrieving, all terms that the user is provided are retrieval model B of set formation of index; Be that retrieval model B is β=(b
1, b
2..., b
j..., b
n), comprise n index altogether;
Step 5: current retrieval model B is configured to a retrieval vector; Construction method is:
Weights with all indexes of comprising among the current retrieval model B constitute the retrieval vector
Wherein
Index b among the expression retrieval model B
jWeights, its assignment is adopted one of following two kinds of methods:
(1) import the order of this index or user's subjectivity according to the user and think that the main degree of itself and result for retrieval carries out assignment, index is main more or the forward more then weights of input sequence are big more;
(2) identical weights are all taked in each index among the retrieval model B, promptly do not distinguish index order and main degree;
Step 6: calculate the similarity between the model of literature of current retrieval model B and each document, similarity is big more thinks that then degree of correlation is big more between the result for retrieval that the document and user need, the similarity Sim of model of literature A and retrieval model B (A, B) adopt following formula to calculate:
Wherein,
Be index a in the document model
kWeights, T
KjIndex b among the expression retrieval model B
jWith the index a among the model of literature A to be compared
kBetween distance, the incidence relation between the index of the index storehouse defined that this is set up apart from foundation step 1 obtains;
Index a among the expression model of literature A
kAnd a
jBetween distance,
Index b among the expression retrieval model B
jAnd b
kBetween distance; Incidence relation between the index of the index storehouse defined that these two distances are also set up according to step 1 obtains;
Step 7: the similarity in current retrieval model B that provides according to step 6 and the document databse between each model of literature, sort from high in the end, the document after the ordering is offered the user as final result for retrieval.
As preferably, before step 6 is carried out, comprise also in advance document carried out scalping that the literature collection that scalping is obtained carry out step 6 then, promptly calculate the similarity between the model of literature of each document in the literature collection that current retrieval model B and scalping obtain.
The contrast prior art, beneficial effect of the present invention is, can be under the situation of the index that the user is thought by mistake as the reasoning condition, the wrong result that disturbs of reduction, and according to wrong index condition, in conjunction with the incidence relation between the existing index, allow correct index participate in the reasoning computing, thereby realize the correctness and the anti-interference of reasoning.
Description of drawings
Fig. 1 is the inventive method synoptic diagram;
Fig. 2 is an index space distribution synoptic diagram.
Embodiment
Below in conjunction with accompanying drawing content of the present invention is made an explanation.
In this technical scheme, see accompanying drawing 1, at first make up document databse, index storehouse.Document databse, index storehouse are respectively the set of model of literature and index clauses and subclauses, and set up the association between them.Index and document are the relations of multi-to-multi, the corresponding a plurality of documents of each index, the corresponding a plurality of indexes of each document, the different corresponding different documents of index combination; Each document has one or several indexes.
The principal character of document has been expressed in index, and in the technical program, the term combination that provides according to each user is an index information, and modeling is carried out in this retrieval of user.Modeling is exactly the index information with user's input, writes down and puts in order, is used to infer document.In primary retrieval operation, all indexes set that the user provides constitute a retrieval model, and these index information all are that the canonical name by term constitutes, and every piece of relevant index set of document constitutes a model of literature.Document databse is made of a plurality of model of literature.
We regard the index set (set element is the index that the document has) that every piece of document has as a vector, also regard the term set of user's input as a vector simultaneously, each vector element all is an index, and associations such as going up the next, coordination is arranged between the index.Hyponymy is meant the subordinate relation between the index, and apposition comprises synonym, nearly justice, similar, easily mixes relations such as friendship.The index spatial distribution map is seen accompanying drawing 2.
Therefore in the present invention, at first to set up one related index storehouses such as going up the next, coordination is arranged according to basic term information, described hyponymy is meant the subordinate relation between the index, apposition comprises synonym, nearly justice, similar, easily mix relations such as friendship, described basic term information is to be made of canonical name.Same index, the user has different expression when subjective description, as " route of transmission " and " communication channel ", " patient " and " patient " etc.Close index has deviation even error when subjective description, as " feeling sick " and " vomiting ", " redness " and " cyanosis " etc., i.e. apposition, also have some indexes to have hyponymy, the user may only import the next index when retrieval, or has only imported upper index.Provide an embodiment below:
Step 1: replenish with these basic terms according to the canonical name of term and to have the vocabulary of going up incidence relations such as the next, coordination, with basic term and the formation element of the vocabulary that replenishes, and set up and preserve the incidence relation between per two indexes in the index storehouse as the index storehouse; Incidence relation between described per two indexes comprises the next, coordination incidence relation; Wherein said hyponymy comprises the subordinate relation between the index, and described apposition comprises synonym, nearly justice, similarity relation;
Step 2: the term that every piece of document is relevant is as the index of the document, constitute the model of literature of the document with the set (set element is the index that the document was had) of the index that every piece of document was had, with the formation element of each model of literature as document databse; For example the model of literature of document A is α=(a
1, a
2..., a
k..., a
m), comprise m index altogether;
Step 3: each model of literature is configured to a document vector, and construction method is:
The weights of all indexes that comprise with model of literature A constitute document vector
Be index a among the model of literature A
kWeights, according to this index frequency of occurrences and position etc. occurs and set in advance in the document, a is drawn in its list of values indicating
kAnd the big more explanation correlation degree of the correlation degree between the document A, weights is high more; Index a for example
kAppear in the title division or digest information or description information of document A, perhaps the frequency of occurrences is very high in document A, then default big weights;
Step 4: before retrieving, all terms that the user is provided are retrieval model B of set formation of index; Be that retrieval model B is β=(b
1, b
2..., b
j..., b
n), comprise n index altogether;
Step 5: current retrieval model B is configured to a retrieval vector; Construction method is:
Weights with all indexes of comprising among the current retrieval model B constitute the retrieval vector
Index b among the expression retrieval model B
jWeights, its assignment is adopted one of following two kinds of methods:
(1) import the order of this index or user's subjectivity according to the user and think that the main degree of itself and result for retrieval carries out assignment, index is main more or the forward more then weights of input sequence are big more;
(2) identical weights are taked in each index among the retrieval model B, promptly do not distinguish index order and main degree, for example, as long as occur in the retrieval model of user's input this index then the weights of this index be made as 1;
Step 6: calculate the similarity between the model of literature of each document in current retrieval model B and the document databse, similarity is big more thinks that then degree of correlation is big more between the result for retrieval that the document and user need, the similarity Sim of model of literature A and retrieval model B (A, B) adopt following formula to calculate:
Wherein, T
KjRepresent two index a
kAnd b
jBetween distance, on the basis of the incidence relation in the given index storehouse of step 1 between per two indexes, because the similarity of different indexes must provide the index b among the retrieval model B
jWith the index a among the model of literature A to be compared
kBetween distance, the incidence relation between the index of the index storehouse defined that this is set up apart from foundation step 1 obtains.For example: as preferably, the index b among the retrieval model B
jWith the index a among the model of literature A to be compared
kAs follows between two indexes: b apart from value
jWith a
kIdentical then between apart from T
KjBe 1; If b
jWith a
kFor subordinate relation then the distance be 0.5; If b
jWith a
kFor brotherhood then the distance be 0.25; If b
jWith a
kFor synonymy then the distance be 1; If b
jWith a
kFor similarity relation then the distance be 0.6; If b
jWith a
kFor irrelevant then distance is 0.The value of distance is provided with according to experience, in use can progressively adjust.
Index a among the expression model of literature A
kAnd a
jBetween distance,
Index b among the expression retrieval model B
jAnd b
kBetween distance.Incidence relation between the index of the index storehouse defined that these two distances are also set up according to step 1 obtains.As preferably,
With
The value principle be: two indexes identical then the distance be 1; Subordinate relation then distance is 0.5; Brotherhood then distance is 0.25; Synonymy then distance is 1; Similarity relation then distance is 0.6; Irrelevant then distance is 0.Also rule of thumb be provided with apart from value, in use can progressively adjust.
As preferably, before step 6 is carried out, comprise also in advance document carried out scalping that the literature collection that scalping is obtained carry out step 6 then, promptly calculate the similarity between the model of literature of each document in the literature collection that current retrieval model B and scalping obtain;
Step 7: the similarity in current retrieval model B that provides according to step 6 and the document databse between the model of literature of each document, sort from high in the end, offer the user as final result for retrieval.
Above-described specific descriptions; purpose, technical scheme and beneficial effect to invention further describe; institute is understood that; the above only is specific embodiments of the invention; and be not intended to limit the scope of the invention; within the spirit and principles in the present invention all, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (6)
1. the information retrieval sort method based on the index incidence relation is characterized in that, comprises the steps:
Step 1: replenish with these basic terms according to the canonical name of term and to have the vocabulary of going up incidence relations such as the next, coordination, with basic term and the formation element of the vocabulary that replenishes, and set up and preserve the incidence relation between per two indexes in the index storehouse as the index storehouse;
Step 2: the term that every piece of document is relevant constitutes the model of literature α=(a of the document as the index of the document with the set of the index that every piece of document was had
1, a
2..., a
k..., a
m), wherein m is the number of the index that has of the document; With the formation element of each model of literature as document databse;
Step 3: each model of literature is configured to a document vector, and construction method is:
The weights of all indexes that comprise with document model constitute document vector
Wherein
Be index a in the document model
kWeights, a is drawn in its list of values indicating
kAnd the big more then correlation degree of the correlation degree between the document A, weights is high more;
Step 4: before retrieving, all terms that the user is provided are retrieval model B of set formation of index; Be that retrieval model B is β=(b
1, b
2..., b
j..., b
n), comprise n index altogether;
Step 5: current retrieval model B is configured to a retrieval vector; Construction method is:
Weights with all indexes of comprising among the current retrieval model B constitute the retrieval vector
Wherein
Index b among the expression retrieval model B
jWeights, its assignment is adopted one of following two kinds of methods:
(1) import the order of this index or user's subjectivity according to the user and think that the main degree of itself and result for retrieval carries out assignment, index is main more or the forward more then weights of input sequence are big more;
(2) identical weights are all taked in each index among the retrieval model B, promptly do not distinguish index order and main degree;
Step 6: calculate the similarity between the model of literature of current retrieval model B and each document, similarity is big more thinks that then degree of correlation is big more between the result for retrieval that the document and user need, the similarity Sim of model of literature A and retrieval model B (A, B) adopt following formula to calculate:
Wherein,
Be index a in the document model
kWeights, T
KiIndex b among the expression retrieval model B
jWith the index a among the model of literature A to be compared
kBetween distance, the incidence relation between the index of the index storehouse defined that this is set up apart from foundation step 1 obtains;
Index a among the expression model of literature A
kAnd a
jBetween distance,
Index b among the expression retrieval model B
jAnd b
kBetween distance; Incidence relation between the index of the index storehouse defined that these two distances are also set up according to step 1 obtains;
Step 7: the similarity in current retrieval model B that provides according to step 6 and the document databse between each model of literature, sort from high in the end, the document after the ordering is offered the user as final result for retrieval.
2. according to the described a kind of information retrieval sort method of claim 1, it is characterized in that the incidence relation described in the step 1 between per two indexes comprises the next, coordination incidence relation based on the index incidence relation; Wherein said hyponymy comprises the subordinate relation between the index, and described apposition comprises synonym, nearly justice, similarity relation.
3. according to the described a kind of information retrieval sort method of claim 1, it is characterized in that index a in the model of literature described in the step 3 based on the index incidence relation
kWeights
Be according to this index a
kThe frequency of occurrences and/or occur that the position sets in advance in the document.
4. according to described any the information retrieval sort method of claim 1-3, it is characterized in that based on the index incidence relation, in the step 6, the index b among the retrieval model B
jWith the index a among the model of literature A to be compared
kAs follows between two indexes: b apart from value
jWith a
kIdentical then between apart from T
KjBe 1; If b
jWith a
kFor subordinate relation then the distance be 0.5; If b
jWith a
kFor brotherhood then the distance be 0.25; If b
jWith a
kFor synonymy then the distance be 1; If b
jWith a
kFor similarity relation then the distance be 0.6; If b
jWith a
kFor irrelevant then distance is 0.
5. according to described any the information retrieval sort method of claim 1-3, it is characterized in that based on the index incidence relation, in the step 6, described
With
The value principle be: two indexes identical then the distance be 1; Subordinate relation then distance is 0.5; Brotherhood then distance is 0.25; Synonymy then distance is 1; Similarity relation then distance is 0.6; Irrelevant then distance is 0.
6. according to described any the information retrieval sort method of claim 1-3 based on the index incidence relation, it is characterized in that, before step 6 is carried out, also comprise in advance document is carried out scalping, the literature collection that scalping is obtained carry out step 6 then, promptly calculates the similarity between the model of literature of each document in the literature collection that current retrieval model B and scalping obtain.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110083624A CN102163222B (en) | 2011-04-02 | 2011-04-02 | Information search sequencing method based on index association relation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110083624A CN102163222B (en) | 2011-04-02 | 2011-04-02 | Information search sequencing method based on index association relation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102163222A true CN102163222A (en) | 2011-08-24 |
CN102163222B CN102163222B (en) | 2012-09-05 |
Family
ID=44464449
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110083624A Expired - Fee Related CN102163222B (en) | 2011-04-02 | 2011-04-02 | Information search sequencing method based on index association relation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102163222B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831134A (en) * | 2011-12-16 | 2012-12-19 | 中国科学技术信息研究所 | Novel semi-automatic indexing method of Chinese scientific and technical documents |
CN106845345A (en) * | 2016-12-15 | 2017-06-13 | 重庆凯泽科技股份有限公司 | Biopsy method and device |
CN115203598A (en) * | 2022-07-20 | 2022-10-18 | 贝壳找房(北京)科技有限公司 | Information sorting method, electronic device and storage medium in real estate field |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060026152A1 (en) * | 2004-07-13 | 2006-02-02 | Microsoft Corporation | Query-based snippet clustering for search result grouping |
CN101477536A (en) * | 2008-12-30 | 2009-07-08 | 华中科技大学 | Scientific and technical literature entity integrated ranking method based on associating network |
CN101556603A (en) * | 2009-05-06 | 2009-10-14 | 北京航空航天大学 | Coordinate search method used for reordering search results |
-
2011
- 2011-04-02 CN CN201110083624A patent/CN102163222B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060026152A1 (en) * | 2004-07-13 | 2006-02-02 | Microsoft Corporation | Query-based snippet clustering for search result grouping |
CN101477536A (en) * | 2008-12-30 | 2009-07-08 | 华中科技大学 | Scientific and technical literature entity integrated ranking method based on associating network |
CN101556603A (en) * | 2009-05-06 | 2009-10-14 | 北京航空航天大学 | Coordinate search method used for reordering search results |
Non-Patent Citations (1)
Title |
---|
《计算机工程与应用》 20071231 赵正文等 "Web信息检索结构化排序函数与标引词加权技术" 181-184 1-6 , * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831134A (en) * | 2011-12-16 | 2012-12-19 | 中国科学技术信息研究所 | Novel semi-automatic indexing method of Chinese scientific and technical documents |
CN102831134B (en) * | 2011-12-16 | 2015-02-25 | 中国科学技术信息研究所 | Novel semi-automatic indexing method of Chinese scientific and technical documents |
CN106845345A (en) * | 2016-12-15 | 2017-06-13 | 重庆凯泽科技股份有限公司 | Biopsy method and device |
CN115203598A (en) * | 2022-07-20 | 2022-10-18 | 贝壳找房(北京)科技有限公司 | Information sorting method, electronic device and storage medium in real estate field |
CN115203598B (en) * | 2022-07-20 | 2023-09-19 | 贝壳找房(北京)科技有限公司 | Information ordering method in real estate field, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN102163222B (en) | 2012-09-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103246670B (en) | Microblogging sequence, search, methods of exhibiting and system | |
CN105653706A (en) | Multilayer quotation recommendation method based on literature content mapping knowledge domain | |
CN108038240A (en) | Based on content, the social networks rumour detection method of user's multiplicity | |
CN106372072A (en) | Location-based recognition method for user relations in mobile social network | |
CN104239496B (en) | A kind of method of combination fuzzy weighted values similarity measurement and cluster collaborative filtering | |
CN104899273A (en) | Personalized webpage recommendation method based on topic and relative entropy | |
CN107169873A (en) | A kind of microblog users authority evaluation method of multiple features fusion | |
CN103136289B (en) | Resource recommendation method and system | |
CN106202294A (en) | The related news computational methods merged based on key word and topic model and device | |
CN104133868B (en) | A kind of strategy integrated for the classification of vertical reptile data | |
CN105069122A (en) | Personalized recommendation method and recommendation apparatus based on user behaviors | |
CN103631862B (en) | Event characteristic evolution excavation method and system based on microblogs | |
CN104866558A (en) | Training method of social networking account mapping model, mapping method and system | |
CN105975609A (en) | Industrial design product intelligent recommendation method and system | |
CN104881472A (en) | Combined recommendation method of traveling scenic spots based on network data collection | |
CN106407482B (en) | A kind of network academic report category method based on multi-feature fusion | |
CN107239512A (en) | The microblogging comment spam recognition methods of relational network figure is commented in a kind of combination | |
CN103927339B (en) | Knowledge Reorganizing system and method for knowledge realignment | |
CN102163222B (en) | Information search sequencing method based on index association relation | |
US8639695B1 (en) | System, method and computer program for analysing and visualising data | |
CN106339459A (en) | Method for pre-classifying Chinese webpages based on keyword matching | |
CN105447633A (en) | Scientific research institution integration evaluation method and system thereof | |
CN103823809A (en) | Query phrase classification method and device, and classification optimization method and device | |
CN104156431A (en) | RDF keyword research method based on stereogram community structure | |
CN106528614A (en) | Method for predicting geographical location of user in mobile social network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120905 Termination date: 20150402 |
|
EXPY | Termination of patent right or utility model |