CN102163222A

CN102163222A - Information search sequencing method based on index association relation

Info

Publication number: CN102163222A
Application number: CN 201110083624
Authority: CN
Inventors: 池慧; 高东平; 方安; 洪娜
Original assignee: Institute of Medical Information CAMS
Current assignee: Institute of Medical Information CAMS
Priority date: 2011-04-02
Filing date: 2011-04-02
Publication date: 2011-08-24
Anticipated expiration: 2031-04-02
Also published as: CN102163222B

Abstract

The invention discloses an information search sequencing method based on an index association relation, which belongs to the field of information analysis and aid decision making. The method comprises the following steps of: creating a document database and an index library, and creating the association between the document database and the index library; taking related search terms of each document as the indexes of the document to form a document model of the document; before searching, forming a search model with the set of all the search terms, namely the indexes, provided by users; and calculating the similarity between the search model and the document model of each document in the document database, sequencing the documents in a descending order, and providing the sequenced documents for the users as the final search result. The information search sequencing method based on the index association relation has the benefit that the interference of the errors on the result can be weakened under the circumstance of taking the mistaken indexes as reasoning conditions, by integrating the prior association relation between the indexes, the correct indexes can participate in the reasoning operation according to the wrong index conditions, and consequently, the correctness and the interference immunity of the reasoning can be implemented.

Description

Information retrieval sort method based on the index incidence relation

Technical field

The present invention relates to a kind of sort method, belong to information analysis and aid decision making field result for retrieval.

Background technology

When searching document,, especially when literature search is carried out in certain subject or field, occur because the term improper use through regular meeting in order to obtain result for retrieval more accurately, or user's subjective description deviation and the result for retrieval that produces is not accurate.In fact, also has incidence relation between the term, as above the next or apposition etc., according to the incidence relation between the term, can judge term set that the user proposes and the index that has of waiting the to read up the literature similarity between gathering, whether whether being used for obtaining the document is that the user is required, or relevant with user view, thereby result for retrieval is sorted, to improve retrieval accuracy.

Summary of the invention

The object of the present invention is to provide a kind of information retrieval sort method, to solve public's (at the crowd of professional knowledge scarcity or speech habits and statement variation) fast and accurately because the result for retrieval that mistake or inapt description term cause has error based on the index incidence relation.By the relationship maps of index to document, realize the ordering of result for retrieval, a series of result for retrieval of degree of correlation maximum are provided to the user.Be particularly useful for close or similar information being retrieved relevancy ranking at professional range.

A kind of information retrieval sort method based on the index incidence relation of the present invention comprises the steps:

Step 1: replenish with these basic terms according to the canonical name of term and to have the vocabulary of going up incidence relations such as the next, coordination, with basic term and the formation element of the vocabulary that replenishes, and set up and preserve the incidence relation between per two indexes in the index storehouse as the index storehouse; Incidence relation between described per two indexes comprises the next, coordination incidence relation; Wherein said hyponymy comprises the subordinate relation between the index, and described apposition comprises synonym, nearly justice, similarity relation.

Step 2: the term that every piece of document is relevant constitutes the model of literature α=(a of the document as the index of the document with the set of the index that every piece of document was had ₁, a ₂..., a _k..., a _m), wherein m is the number of the index that has of the document; With the formation element of each model of literature as document databse;

Step 3: each model of literature is configured to a document vector, and construction method is:

The weights of all indexes that comprise with document model constitute document vector

Wherein

Be index a in the document model _kWeights, a is drawn in its list of values indicating _kAnd the big more then correlation degree of the correlation degree between the document A, weights is high more; As preferably, index a in the described model of literature _kWeights

Be according to this index a _kThe frequency of occurrences and/or occur that the position sets in advance in the document.

Step 4: before retrieving, all terms that the user is provided are retrieval model B of set formation of index; Be that retrieval model B is β=(b ₁, b ₂..., b _j..., b _n), comprise n index altogether;

Step 5: current retrieval model B is configured to a retrieval vector; Construction method is:

Weights with all indexes of comprising among the current retrieval model B constitute the retrieval vector

Wherein

Index b among the expression retrieval model B _jWeights, its assignment is adopted one of following two kinds of methods:

(1) import the order of this index or user's subjectivity according to the user and think that the main degree of itself and result for retrieval carries out assignment, index is main more or the forward more then weights of input sequence are big more;

(2) identical weights are all taked in each index among the retrieval model B, promptly do not distinguish index order and main degree;

Step 6: calculate the similarity between the model of literature of current retrieval model B and each document, similarity is big more thinks that then degree of correlation is big more between the result for retrieval that the document and user need, the similarity Sim of model of literature A and retrieval model B (A, B) adopt following formula to calculate:

Sim (A, B) = \frac{Σ_{k = 1}^{m} (Σ_{j = 1}^{n} W_{a_{k}} W_{b_{j}} T_{kj})}{{(Σ_{k = 1}^{m} [Σ_{j = 1}^{m} {(W_{a_{k}} T_{a_{kj}})}^{2} + W_{a_{k}}^{2}]) (Σ_{j = 1}^{n} [Σ_{k = 1}^{n} {(W_{b_{j}} T_{b_{jk}})}^{2} + W_{b_{j}}^{2}])}^{\frac{1}{2}}}

Wherein,

Be index a in the document model _kWeights, T _KjIndex b among the expression retrieval model B _jWith the index a among the model of literature A to be compared _kBetween distance, the incidence relation between the index of the index storehouse defined that this is set up apart from foundation step 1 obtains;

Index a among the expression model of literature A _kAnd a _jBetween distance,

Index b among the expression retrieval model B _jAnd b _kBetween distance; Incidence relation between the index of the index storehouse defined that these two distances are also set up according to step 1 obtains;

Step 7: the similarity in current retrieval model B that provides according to step 6 and the document databse between each model of literature, sort from high in the end, the document after the ordering is offered the user as final result for retrieval.

As preferably, before step 6 is carried out, comprise also in advance document carried out scalping that the literature collection that scalping is obtained carry out step 6 then, promptly calculate the similarity between the model of literature of each document in the literature collection that current retrieval model B and scalping obtain.

The contrast prior art, beneficial effect of the present invention is, can be under the situation of the index that the user is thought by mistake as the reasoning condition, the wrong result that disturbs of reduction, and according to wrong index condition, in conjunction with the incidence relation between the existing index, allow correct index participate in the reasoning computing, thereby realize the correctness and the anti-interference of reasoning.

Description of drawings

Fig. 1 is the inventive method synoptic diagram;

Fig. 2 is an index space distribution synoptic diagram.

Embodiment

Below in conjunction with accompanying drawing content of the present invention is made an explanation.

In this technical scheme, see accompanying drawing 1, at first make up document databse, index storehouse.Document databse, index storehouse are respectively the set of model of literature and index clauses and subclauses, and set up the association between them.Index and document are the relations of multi-to-multi, the corresponding a plurality of documents of each index, the corresponding a plurality of indexes of each document, the different corresponding different documents of index combination; Each document has one or several indexes.

The principal character of document has been expressed in index, and in the technical program, the term combination that provides according to each user is an index information, and modeling is carried out in this retrieval of user.Modeling is exactly the index information with user's input, writes down and puts in order, is used to infer document.In primary retrieval operation, all indexes set that the user provides constitute a retrieval model, and these index information all are that the canonical name by term constitutes, and every piece of relevant index set of document constitutes a model of literature.Document databse is made of a plurality of model of literature.

We regard the index set (set element is the index that the document has) that every piece of document has as a vector, also regard the term set of user's input as a vector simultaneously, each vector element all is an index, and associations such as going up the next, coordination is arranged between the index.Hyponymy is meant the subordinate relation between the index, and apposition comprises synonym, nearly justice, similar, easily mixes relations such as friendship.The index spatial distribution map is seen accompanying drawing 2.

Therefore in the present invention, at first to set up one related index storehouses such as going up the next, coordination is arranged according to basic term information, described hyponymy is meant the subordinate relation between the index, apposition comprises synonym, nearly justice, similar, easily mix relations such as friendship, described basic term information is to be made of canonical name.Same index, the user has different expression when subjective description, as " route of transmission " and " communication channel ", " patient " and " patient " etc.Close index has deviation even error when subjective description, as " feeling sick " and " vomiting ", " redness " and " cyanosis " etc., i.e. apposition, also have some indexes to have hyponymy, the user may only import the next index when retrieval, or has only imported upper index.Provide an embodiment below:

Step 1: replenish with these basic terms according to the canonical name of term and to have the vocabulary of going up incidence relations such as the next, coordination, with basic term and the formation element of the vocabulary that replenishes, and set up and preserve the incidence relation between per two indexes in the index storehouse as the index storehouse; Incidence relation between described per two indexes comprises the next, coordination incidence relation; Wherein said hyponymy comprises the subordinate relation between the index, and described apposition comprises synonym, nearly justice, similarity relation;

Step 2: the term that every piece of document is relevant is as the index of the document, constitute the model of literature of the document with the set (set element is the index that the document was had) of the index that every piece of document was had, with the formation element of each model of literature as document databse; For example the model of literature of document A is α=(a ₁, a ₂..., a _k..., a _m), comprise m index altogether;

The weights of all indexes that comprise with model of literature A constitute document vector

Be index a among the model of literature A _kWeights, according to this index frequency of occurrences and position etc. occurs and set in advance in the document, a is drawn in its list of values indicating _kAnd the big more explanation correlation degree of the correlation degree between the document A, weights is high more; Index a for example _kAppear in the title division or digest information or description information of document A, perhaps the frequency of occurrences is very high in document A, then default big weights;

(2) identical weights are taked in each index among the retrieval model B, promptly do not distinguish index order and main degree, for example, as long as occur in the retrieval model of user's input this index then the weights of this index be made as 1;

Step 6: calculate the similarity between the model of literature of each document in current retrieval model B and the document databse, similarity is big more thinks that then degree of correlation is big more between the result for retrieval that the document and user need, the similarity Sim of model of literature A and retrieval model B (A, B) adopt following formula to calculate:

Sim (A, B) = \frac{Σ_{k = 1}^{m} (Σ_{j = 1}^{n} W_{a_{k}} W_{b_{j}} T_{kj})}{{(Σ_{k = 1}^{m} [Σ_{j = 1}^{m} {(W_{a_{k}} T_{a_{kj}})}^{2} + W_{a_{k}}^{2}]) (Σ_{j = 1}^{n} [Σ_{k = 1}^{n} {(W_{b_{j}} T_{b_{jk}})}^{2} + W_{b_{j}}^{2}])}^{\frac{1}{2}}}

Wherein, T _KjRepresent two index a _kAnd b _jBetween distance, on the basis of the incidence relation in the given index storehouse of step 1 between per two indexes, because the similarity of different indexes must provide the index b among the retrieval model B _jWith the index a among the model of literature A to be compared _kBetween distance, the incidence relation between the index of the index storehouse defined that this is set up apart from foundation step 1 obtains.For example: as preferably, the index b among the retrieval model B _jWith the index a among the model of literature A to be compared _kAs follows between two indexes: b apart from value _jWith a _kIdentical then between apart from T _KjBe 1; If b _jWith a _kFor subordinate relation then the distance be 0.5; If b _jWith a _kFor brotherhood then the distance be 0.25; If b _jWith a _kFor synonymy then the distance be 1; If b _jWith a _kFor similarity relation then the distance be 0.6; If b _jWith a _kFor irrelevant then distance is 0.The value of distance is provided with according to experience, in use can progressively adjust.

Index a among the expression model of literature A _kAnd a _jBetween distance,

Index b among the expression retrieval model B _jAnd b _kBetween distance.Incidence relation between the index of the index storehouse defined that these two distances are also set up according to step 1 obtains.As preferably,

With

The value principle be: two indexes identical then the distance be 1; Subordinate relation then distance is 0.5; Brotherhood then distance is 0.25; Synonymy then distance is 1; Similarity relation then distance is 0.6; Irrelevant then distance is 0.Also rule of thumb be provided with apart from value, in use can progressively adjust.

As preferably, before step 6 is carried out, comprise also in advance document carried out scalping that the literature collection that scalping is obtained carry out step 6 then, promptly calculate the similarity between the model of literature of each document in the literature collection that current retrieval model B and scalping obtain;

Step 7: the similarity in current retrieval model B that provides according to step 6 and the document databse between the model of literature of each document, sort from high in the end, offer the user as final result for retrieval.

Above-described specific descriptions; purpose, technical scheme and beneficial effect to invention further describe; institute is understood that; the above only is specific embodiments of the invention; and be not intended to limit the scope of the invention; within the spirit and principles in the present invention all, any modification of being made, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the information retrieval sort method based on the index incidence relation is characterized in that, comprises the steps:

Step 1: replenish with these basic terms according to the canonical name of term and to have the vocabulary of going up incidence relations such as the next, coordination, with basic term and the formation element of the vocabulary that replenishes, and set up and preserve the incidence relation between per two indexes in the index storehouse as the index storehouse;

Wherein Be index a in the document model _kWeights, a is drawn in its list of values indicating _kAnd the big more then correlation degree of the correlation degree between the document A, weights is high more;

Wherein

Sim (A, B) = \frac{Σ_{k = 1}^{m} (Σ_{j = 1}^{n} W_{a_{k}} W_{b_{j}} T_{kj})}{{(Σ_{k = 1}^{m} [Σ_{j = 1}^{m} {(W_{a_{k}} T_{a_{kj}})}^{2} + W_{a_{k}}^{2}]) (Σ_{j = 1}^{n} [Σ_{k = 1}^{n} {(W_{b_{j}} T_{b_{jk}})}^{2} + W_{b_{j}}^{2}])}^{\frac{1}{2}}}

Wherein,

Be index a in the document model _kWeights, T _KiIndex b among the expression retrieval model B _jWith the index a among the model of literature A to be compared _kBetween distance, the incidence relation between the index of the index storehouse defined that this is set up apart from foundation step 1 obtains;

Index a among the expression model of literature A _kAnd a _jBetween distance, Index b among the expression retrieval model B _jAnd b _kBetween distance; Incidence relation between the index of the index storehouse defined that these two distances are also set up according to step 1 obtains;

2. according to the described a kind of information retrieval sort method of claim 1, it is characterized in that the incidence relation described in the step 1 between per two indexes comprises the next, coordination incidence relation based on the index incidence relation; Wherein said hyponymy comprises the subordinate relation between the index, and described apposition comprises synonym, nearly justice, similarity relation.

3. according to the described a kind of information retrieval sort method of claim 1, it is characterized in that index a in the model of literature described in the step 3 based on the index incidence relation _kWeights

4. according to described any the information retrieval sort method of claim 1-3, it is characterized in that based on the index incidence relation, in the step 6, the index b among the retrieval model B _jWith the index a among the model of literature A to be compared _kAs follows between two indexes: b apart from value _jWith a _kIdentical then between apart from T _KjBe 1; If b _jWith a _kFor subordinate relation then the distance be 0.5; If b _jWith a _kFor brotherhood then the distance be 0.25; If b _jWith a _kFor synonymy then the distance be 1; If b _jWith a _kFor similarity relation then the distance be 0.6; If b _jWith a _kFor irrelevant then distance is 0.

5. according to described any the information retrieval sort method of claim 1-3, it is characterized in that based on the index incidence relation, in the step 6, described

With

The value principle be: two indexes identical then the distance be 1; Subordinate relation then distance is 0.5; Brotherhood then distance is 0.25; Synonymy then distance is 1; Similarity relation then distance is 0.6; Irrelevant then distance is 0.

6. according to described any the information retrieval sort method of claim 1-3 based on the index incidence relation, it is characterized in that, before step 6 is carried out, also comprise in advance document is carried out scalping, the literature collection that scalping is obtained carry out step 6 then, promptly calculates the similarity between the model of literature of each document in the literature collection that current retrieval model B and scalping obtain.