WO2003107141A2 - Para-linguistic expansion - Google Patents

Para-linguistic expansion Download PDF

Info

Publication number
WO2003107141A2
WO2003107141A2 PCT/US2003/019338 US0319338W WO03107141A2 WO 2003107141 A2 WO2003107141 A2 WO 2003107141A2 US 0319338 W US0319338 W US 0319338W WO 03107141 A2 WO03107141 A2 WO 03107141A2
Authority
WO
WIPO (PCT)
Prior art keywords
keytuple
keytuples
data
search data
text
Prior art date
Application number
PCT/US2003/019338
Other languages
French (fr)
Other versions
WO2003107141A3 (en
Inventor
Kenneth Haase
Original Assignee
Beingmeta, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beingmeta, Inc. filed Critical Beingmeta, Inc.
Priority to AU2003253663A priority Critical patent/AU2003253663A1/en
Publication of WO2003107141A2 publication Critical patent/WO2003107141A2/en
Publication of WO2003107141A3 publication Critical patent/WO2003107141A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • the present invention relates to systems and methods for improving the precision and recall of free text natural language queries against textual databases.
  • Retrieval of textual information for human beings or their intelligent agents is a hit-or-miss process attempting to match the information needs of a human user with the knowledge content of information items in a database.
  • the chief complicating factor in this matchmaking is that information needs and knowledge content are based on concepts, meanings, and relations while the information items themselves and typically the descriptions of individual information needs are based on sequences of ambiguous words in a particular natural language.
  • Most algorithms for textual information retrieval work by using statistical or probabilistic properties of large ensembles of text to attempt to extract meaning of words. In addition to the inherent errors of such approximations, these approaches suffer from their reliance on the actual word forms in the text.
  • chips While nearly all of them generalize word forms through stemming (so that "chips” becomes “chip”), they do not typically expand “chip” to other base word forms that may have the same meaning. So “chip” is not extended to "integrated circuit” (in an electronics domain) or (ambiguously) to "crisp” or “french fry” (in the food domain), and “sample” (in the paint domain).
  • query expansion where textual queries are expanded by a thesaurus, so that a search for "chip” will find documents referring to french fries, crisps, integrated circuits, and samples.
  • This assortment illustrates the problem with straightforward query expansion: it retrieves too many unrelated documents because it does not reflect the meaning of the word in its context in the original query.
  • recall is a measure of how many of the relevant documents were actually found by the algorithm
  • precision is a measure of how many of the documents found were actually relevant.
  • query expansion increases the recall rate of the algorithm while decreasing the precision rate.
  • lowered precision has a serious cost because a human expert has to sift through the erroneous results to filter out the actually relevant articles.
  • Embodiments of the invention include: a background linguistic database capable of generating synonyms and other related terms; and a textual pre-processor capable of extracting para-linguistic word associations, e.g. pairs and triples, of words from natural language text.
  • Embodiments of the invention process the text to produce a "para-linguistic" representation where associations, e.g. pairs or triplets, of words (“keytuples”) represent probable linguistic relationships between words in the text.
  • this processing is applied to both the texts of the document base and to a query entered by a user or their agent. Texts or text fragments are then indexed using the keytuples in much the same way that traditional information retrieval systems index text via single keywords.
  • Embodiments of the invention expand elements of these para-linguistic compounds using the linguistic database.
  • query expansion is applied to the individual terms of generated keytuples to generate "extended keytuples".
  • Traditional information retrieval techniques are then applied to find documents whose keytuples match the extended keytuples derived from the query.
  • These expansions are able to identify documents or fragments using synonymous words, but the combination of words into keytuples encodes part of the context, reducing the erroneous retrieval of documents based on other meanings of the expanded word.
  • embodiments of the invention use para-linguistic keytuples, rather than keywords, to provide for increased precision and contextualization of individual keyword appearances.
  • embodiments of the invention combine query expansion with a keytuple representation to increase recall without decreasing precision.
  • FIG. 1 is a schematic illustration of a system for achieving information retrieval according to one embodiment of the invention.
  • FIG. 2 is a schematic illustration of an indexing embodiment of the system of FIG. 1.
  • FIG. 3 is a schematic illustration of a search embodiment of the system of FIG. 1.
  • FIG. 4 is a schematic illustration of a similarity embodiment of the system of FIG. 1.
  • FIG. 5 is an illustration of part of the analysis performed by one embodiment of the para- linguistic analyzer of FIG. 1.
  • FIG. 6 is an illustration of additional analysis performed by one embodiment of the para- linguistic analyzer of FIG. 1.
  • FIG. 7 is an illustration of an analysis performed by one embodiment of the para-linguistic analyzer of FIG. 1 in the context of an English sentence written in the passive voice.
  • FIG. 8 is an illustration of operation of one embodiment of the keytuple expander of FIG. 1.
  • FIG. 9 is an illustration of operation of another embodiment of the keytuple expander of FIG. 1.
  • FIG. 10 is an illustration of an expansion of the word "meeting" that could be performed by one embodiment of the keytuple expander of FIG. 1.
  • the present invention relates to systems and methods for improving the precision and/or recall of free text natural language queries against textual databases. More generally, the present invention relates to systems and methods for managing at least one data item.
  • a data item includes a document, a text fragment, and a query.
  • search data includes a text fragment and a query.
  • One embodiment of a system according to the invention includes: a fragmenter 102 for breaking compound documents into fragments (typically paragraphs of their equivalent); a para-linguistic analyzer 104 for generating keytuples; a keytuple expander 106 which produces expanded keytuples using a thesaurus for both individual words and (optionally) keytuples; and an information retrieval engine 108 using any of a number of techniques for finding documents based on the similarity of textual keys.
  • Such techniques including Boolean retrieval, vector-space approaches, and probabilistic models typically rely on the extraction of index terms from documents. These techniques can be adapted for this application by the use of keytuples as index terms.
  • Figure 2 depicts an indexing embodiment.
  • a fragmenter 102 fragments documents 110 into fragments 112 and a para-linguistic analyzer 104 analyzes the fragments to extract keytuples 114.
  • a keytuple expander 106 expands the keytuples and the system then feeds these fragments and keytuples to the information retrieval engine 108 to associate the generated keytuples with the fragments, which produced the generated keytuples.
  • Figure 3 depicts a search embodiment.
  • a para-linguistic analyzer 104 receives a query as search data and analyzes the query 116 to extract keytuples 114 and a keytuple expander 106 then expands the keytuples 114 through use of a thesaurus.
  • the system then passes the expanded keytuples 116 to the information retrieval engine 108 to find data items, e.g., documents whose analysis produced the specified keytuples, based on the expanded keytuples.
  • Figure 4 depicts a similarity embodiment.
  • a para-linguistic analyzer 104 receives a text fragment as search data and analyzes the text fragment 112, coming from either a user or (more likely) an application and perhaps derived from a larger document 110, to extract keytuples which a keytuple expander then expands, as in the search embodiment.
  • the system then passes the expanded keytuples 118 onto an information retrieval engine 108.
  • the information retrieval engine finds data items using the expanded keytuples and passes the results to the user 120 or application that provided the original sample text.
  • the similarity embodiment may provide better results than the search embodiment since meaningful para-linguistic relations are more likely to be found in coherent texts than in short user-created queries.
  • One embodiment of the fragmenter takes large compound documents and divides them into smaller chunks for analysis and indexing. These smaller chunks can correspond to paragraphs.
  • the fragmenter determines the fragments by either word processor codes or conventional separations (such as the blank lines separating paragraphs). The system uses a fragmenter because keytuples are typically more precise approximations of meaning than individual keywords and are less likely to appear outside of the sexual contexts for which they are sought.
  • the fragmenter can include original positional information with generated fragments; this information indicates the original position of the fragment within the document and can be used to associate particular document locations with particular extracted keytuples.
  • Such information when carried along with extracted keytuples and appropriately handled by the information retrieval engine, allows the invention to identify "virtual document fragments" based on proximity and crossing the fragmentation boundaries selected by the fragmenter. For example, embodiments of the invention could identify locations where a system according to the invention extracted the keytuples ⁇ analyze,text> and ⁇ retrieve,information> within 100 characters of each other. This 100-character fragment (or a small expansion of it) is a virtual fragment created by the search process from an underlying positional representation.
  • Embodiments of the present invention can use a range of methods, which are aimed at extracting possible word relations within a text but not extracting meanings. These methods use very simple syntactic rules and perform morphological analysis on individual word forms.
  • Ken W. Church provided one of the earliest applications of such methods in a 1980 paper entitled “On Memory Limitations in Natural Language Processing,” published in MIT Laboratory of Computer Science Technical Report MIT/LCS/TR-245, and incorporated herein by reference in its entirety.
  • automatic analysis heuristically extracted possible triples from text.
  • a major component of such algorithms is part-of-speech determination, using methods such as hidden Markov models as described by D. Cutting, J.
  • the purpose of the para-linguistic analyzer is not to extract unambiguous meaning or logical form, but to identify significant combinations of words from the source documents that indicate relationships and relevant context.
  • a PLA begins by tagging individual words in a text with parts of speech and determining root forms.
  • the PLA analyzes the sentence "Tomorrow's meetings with Kodak will be in the Rainsford Room” by determining the root forms of the words (i.e., Tomorrow, meeting, with, Kodak, will, be, in, the, Rainsford Room) and tags the root forms with parts of speech data (i.e., modifier, noun, preposition, name, auxiliary, verb, preposition, article, and name, respectively).
  • the PLA then produces keytuples which connect adjectives to subsequent nouns and nouns to subsequent verbs.
  • the PLA couples a modifier (tomorrow) with a neighboring noun (meeting).
  • the PLA couples a noun (meeting) with a neighboring preposition (with) and name (Kodak) and couples a name (Kodak) with a preposition (in) and another name (Rainsford Room).
  • the PLA also connects nouns and verbs to prepositional arguments and their objects. In different languages, these simple rules would be different and special cases might apply.
  • the general procedure for para-linguistic analysis has the following structure:
  • Figure 7 depicts a special case for English where the PLA handles the passive construction to produce a keytuple that reflects the object-verb-by-subject structure of the English passive. More specifically, the PLA analyzes the sentence "the public presentations will be followed by cocktails" to determine root forms and to tag the root forms (i.e., the, public, presentation, will, be, followed, by, cocktail) with parts of speech data (i.e., article, modifier, noun, auxiliary, be, verb, preposition, and noun, respectively).
  • the PLA applied in an English context uses conventional techniques to determine that the sentence is in the passive voice. Given that the sentence is in the passive voice, the PLA constructs a keytuple, i.e., ⁇ cocktail, follow, presentation ⁇ which reflects the object- verb-by-subject structure of the English passive.
  • the PLA treats certain compound nouns and proper names as single lexical tokens, so that the PLA analyzes "Rainsford Room” as a single word and relates the phrase to other words as a unit.
  • PLA is one embodiment
  • the PLA may use the optional positional information provided by the fragmenter to associated a text document position with each keytuple. This text document position would be passed through the keytuple expander and then onto the information retrieval engine.
  • the KE expands an arbitrary keytuple by expanding the individual words within the keytuple to synonymous words looked up in a thesaurus to create sets of synonyms and then creates a set of keytuples that typically include at most one synonym from each of the sets of synonyms.
  • Figure 8 illustrates the lexical mode.
  • the keytuple is ⁇ meeting, with, Kodak>.
  • the KE expands the word “meeting” to include synonyms "conference” and “discussion” and the word “Kodak” to include synonym “Eastman Kodak.”
  • the KE then creates a set of keytuples that typically include at most one synonym from each of the sets of synonyms, e.g., ⁇ discussion, with, Eastman Kodak>, ⁇ conference, with, Eastman Kodak>, and ⁇ conference, with, Kodak>.
  • the KE can use a keytuple thesaurus to expand particular keytuples in particular ways.
  • Figure 9 illustrates the inference mdoe, combined with the lexical mode of Figure 8.
  • the KE expands the keytuple ⁇ meeting, with, Kodak> using a tuple thesaurus to obtain a first set of keytuples: ⁇ talk, with, Kodak>, ⁇ see, Kodak>, ⁇ meeting, with, Kodak>, and ⁇ meet, with, Kodak>.
  • the KE applies the lexical mode to at least part of the first set of keytuples to obtain a second set of keytuples.
  • the KE expands the keytuple by expanding the individual words within the keytuple to synonymous words looked up in a thesaurus to create sets of synonyms and then creates a second set of keytuples in which each keytuple in the second set of keytuples typically includes at most one synonym from each of the sets of synonyms.
  • a word like 'meeting' may have no direct synonyms but may have near synonyms which have a more precise meaning, e.g., conference, sales call, or board meeting, or a more general meaning, e.g., interaction, gathering, or discussion. For example, in a trip report from a sales representative, meeting can often be taken as meaning "sales call”. Likewise, searches on databases of business transactions are unlikely to include the sense of "meeting" which includes religious revivals or services. For different situations or applications, or when analyzing documents from different sources, different thesauri may be applied in the expansion.
  • the rules for expanding keytuples may reflect the structure or tagging information, i.e., the parts-of-speech data, attached to the keytuple.
  • the KE when expanding the triple ⁇ 'meet7m','Paris'> might not expand the preposition 'in'.
  • the rules used by the KE may also include language-specific semantic preferences, so that the KE might expand ⁇ 'shot','at',?> into ⁇ 'fire','at',?> but not ⁇ 'photograph','atV?'>.
  • Such rules are necessarily language specific and may also be part of the specific application of the invention to a language and domain. For example, applied to a domain of surgical reports, the term
  • the inference mdoe of the KE provides for two related functions.
  • the inference mdoe of the keytuple expander provides a certain inferential component to the search process.
  • a tuple like ⁇ 'fire','gun'> can expand into a tuple like ⁇ 'puHytrigger'>, indicating a relationship, which is not an equivalence of meaning but an inference of circumstance.
  • Such an inference of circumstance is related to the use of inference networks in information retrieval to expand from particular keywords or keyword combinations to other keywords.
  • the table used in inference mdoe of the keytuple expander can be constructed either by hand or by statistical methods over a corpus of texts. Commonly co-occurring keytuples can be entered into this table, as can keytuples that do no co-occur with each other but repeatedly co- occur with the other similar keytuples.
  • Various methods of textual data mining, statistical analysis, and automated thesaurus creation normally applied to individual keywords or co- occurring keyword pairs, can be applied to keytuples in order to create this table.
  • One approach to such generation is discussed by Gregory Grefenstette in his 1993 University of Pittsburgh PhD thesis entitled “Automatic Thesaurus Discovery Via Selective Natural Language Processing: A Corpus Based Approach” and incorporated herein by reference in its entirety.
  • the retrieval engine may use a variety of different algorithms and methods developed over decades of research on information retrieval systems.
  • the key function of the retrieval engine is to take a set of "search keys" and return a set of documents based on those keys. This function may be implemented in numerous ways to reflect the varying degrees of importance of particular search keys either in general or with respect to a particular document.
  • One embodiment of the retrieval engine is an engine that employs the vector space method (discussed above).
  • documents, fragments, and/or queries are represented by large sparse vectors where each component in the vector corresponds to a particular keytuple and the components contain zero if the document or query doesn't contain the keytuple (thus the vectors are always sparse) and contains 1 or a weight (possibly based on other criteria) if the keytuple has been generated by analysis and expansion from the document, fragment, or query.
  • a common criterion for determining the weight is the frequency of a term, either within a document or within the entire corpus, or their combination.
  • one standard metric is to weigh the term by the product of the term frequency (typically normalized to account for document size) and the inverse document frequency (how many documents contain the term, again typically normalized with respect to the number of terms and scaled logarithmically). This takes into account the greater prominence of common terms within a document and the typically lower discriminative utility of terms that occur frequently across documents.
  • the cosine method compares queries and documents by measuring the angle (in a very high dimensional space) between their vectors.
  • the cosine method has been used extensively in information retrieval where the vector elements correspond to keywords.
  • the sparse vectors for keytuples are much larger and sparser than for keywords.
  • weighting calculations can sometimes avoid tracking term frequency within a document.
  • Embodiments of the information retrieval engine associate a text fragment or text fragment identifier with a set of key terms, which are keytuples extracted and expanded from the fragment.
  • One embodiment of the information retrieval engine given a set of terms, returns and ranks documents based on similarity of the sets of analyzed key terms. In the case, where the embodiment uses original positional information, it should also be able to associate such terms with documents and positions and return derived terms occurring near one another in particular documents.

Abstract

The present invention relates to systems and methods for databases. One embodiment of the invention provides a system for managing at least one data item. The system includes: a para-linguistic analyzer operative to receive search data and to identify a first keytuple included in the search data; a keytuple expander in communication with the para-linguistic analyzer and operative to generate a set of keytuples associated with the first keytuple; and an information retrieval engine in communication with the keytuple expander and operative to manage at least one data item based at least in part on the set of keytuples.

Description

PARA-LINGUISTIC EXPANSION
Background of the Invention
The present invention relates to systems and methods for improving the precision and recall of free text natural language queries against textual databases.
Retrieval of textual information for human beings or their intelligent agents is a hit-or-miss process attempting to match the information needs of a human user with the knowledge content of information items in a database. The chief complicating factor in this matchmaking is that information needs and knowledge content are based on concepts, meanings, and relations while the information items themselves and typically the descriptions of individual information needs are based on sequences of ambiguous words in a particular natural language. Most algorithms for textual information retrieval work by using statistical or probabilistic properties of large ensembles of text to attempt to extract meaning of words. In addition to the inherent errors of such approximations, these approaches suffer from their reliance on the actual word forms in the text. While nearly all of them generalize word forms through stemming (so that "chips" becomes "chip"), they do not typically expand "chip" to other base word forms that may have the same meaning. So "chip" is not extended to "integrated circuit" (in an electronics domain) or (ambiguously) to "crisp" or "french fry" (in the food domain), and "sample" (in the paint domain).
Some work has been done in this area, called query expansion, where textual queries are expanded by a thesaurus, so that a search for "chip" will find documents referring to french fries, crisps, integrated circuits, and samples. This assortment illustrates the problem with straightforward query expansion: it retrieves too many unrelated documents because it does not reflect the meaning of the word in its context in the original query.
The problem can be understood more formally in terms of two metrics commonly used to describe information retrieval performance: recall and precision. Recall is a measure of how many of the relevant documents were actually found by the algorithm; precision is a measure of how many of the documents found were actually relevant. Suppose we have a hundred documents of which 20 are relevant to a particular query. If an algorithm finds 15 of these 20 documents, it has a recall rate of 75%; if the algorithm also finds 10 irrelevant documents, it has a precision rate of 60%.
In these terms, query expansion increases the recall rate of the algorithm while decreasing the precision rate. In practical information retrieval contexts, lowered precision has a serious cost because a human expert has to sift through the erroneous results to filter out the actually relevant articles.
Summary of the Invention The present invention relates to systems and methods for improving the precision and/or recall of free text natural language queries against textual databases. Embodiments of the invention include: a background linguistic database capable of generating synonyms and other related terms; and a textual pre-processor capable of extracting para-linguistic word associations, e.g. pairs and triples, of words from natural language text.
Embodiments of the invention process the text to produce a "para-linguistic" representation where associations, e.g. pairs or triplets, of words ("keytuples") represent probable linguistic relationships between words in the text. In one embodiment, this processing is applied to both the texts of the document base and to a query entered by a user or their agent. Texts or text fragments are then indexed using the keytuples in much the same way that traditional information retrieval systems index text via single keywords. Embodiments of the invention expand elements of these para-linguistic compounds using the linguistic database.
When a query is processed for searching, query expansion is applied to the individual terms of generated keytuples to generate "extended keytuples". Traditional information retrieval techniques are then applied to find documents whose keytuples match the extended keytuples derived from the query. These expansions are able to identify documents or fragments using synonymous words, but the combination of words into keytuples encodes part of the context, reducing the erroneous retrieval of documents based on other meanings of the expanded word.
Thus, embodiments of the invention use para-linguistic keytuples, rather than keywords, to provide for increased precision and contextualization of individual keyword appearances. In addition, embodiments of the invention combine query expansion with a keytuple representation to increase recall without decreasing precision.
Brief Description of the Illustrated Embodiments
FIG. 1 is a schematic illustration of a system for achieving information retrieval according to one embodiment of the invention.
FIG. 2 is a schematic illustration of an indexing embodiment of the system of FIG. 1.
FIG. 3 is a schematic illustration of a search embodiment of the system of FIG. 1.
FIG. 4 is a schematic illustration of a similarity embodiment of the system of FIG. 1.
FIG. 5 is an illustration of part of the analysis performed by one embodiment of the para- linguistic analyzer of FIG. 1.
FIG. 6 is an illustration of additional analysis performed by one embodiment of the para- linguistic analyzer of FIG. 1.
FIG. 7 is an illustration of an analysis performed by one embodiment of the para-linguistic analyzer of FIG. 1 in the context of an English sentence written in the passive voice.
FIG. 8 is an illustration of operation of one embodiment of the keytuple expander of FIG. 1.
FIG. 9 is an illustration of operation of another embodiment of the keytuple expander of FIG. 1.
FIG. 10 is an illustration of an expansion of the word "meeting" that could be performed by one embodiment of the keytuple expander of FIG. 1.
Detailed Description of the Invention
The present invention relates to systems and methods for improving the precision and/or recall of free text natural language queries against textual databases. More generally, the present invention relates to systems and methods for managing at least one data item. For present purposes, a data item includes a document, a text fragment, and a query. Similarly, for present purposes, search data includes a text fragment and a query. One embodiment of a system according to the invention, as depicted in Figure 1, includes: a fragmenter 102 for breaking compound documents into fragments (typically paragraphs of their equivalent); a para-linguistic analyzer 104 for generating keytuples; a keytuple expander 106 which produces expanded keytuples using a thesaurus for both individual words and (optionally) keytuples; and an information retrieval engine 108 using any of a number of techniques for finding documents based on the similarity of textual keys. Such techniques, including Boolean retrieval, vector-space approaches, and probabilistic models typically rely on the extraction of index terms from documents. These techniques can be adapted for this application by the use of keytuples as index terms. Ricardo Baeza Yates in Modern Information Retrieval published by Addison Wesley, 1999 and incorporated herein by reference, provides a survey of such techniques. Embodiments of this invention will also work with approaches such as Latent Semantic Indexing, where synthetic index terms are derived based on analysis of a document corpus and its actual index terms.
Three embodiments of the invention are illustrated in Figures 2, 3, and 4.
Figure 2 depicts an indexing embodiment. In the indexing embodiment, a fragmenter 102 fragments documents 110 into fragments 112 and a para-linguistic analyzer 104 analyzes the fragments to extract keytuples 114. A keytuple expander 106 expands the keytuples and the system then feeds these fragments and keytuples to the information retrieval engine 108 to associate the generated keytuples with the fragments, which produced the generated keytuples.
Figure 3 depicts a search embodiment. In the search embodiment, a para-linguistic analyzer 104 receives a query as search data and analyzes the query 116 to extract keytuples 114 and a keytuple expander 106 then expands the keytuples 114 through use of a thesaurus. The system then passes the expanded keytuples 116 to the information retrieval engine 108 to find data items, e.g., documents whose analysis produced the specified keytuples, based on the expanded keytuples.
Figure 4 depicts a similarity embodiment. In the similarity embodiment, a para-linguistic analyzer 104 receives a text fragment as search data and analyzes the text fragment 112, coming from either a user or (more likely) an application and perhaps derived from a larger document 110, to extract keytuples which a keytuple expander then expands, as in the search embodiment. The system then passes the expanded keytuples 118 onto an information retrieval engine 108. The information retrieval engine finds data items using the expanded keytuples and passes the results to the user 120 or application that provided the original sample text.
The similarity embodiment may provide better results than the search embodiment since meaningful para-linguistic relations are more likely to be found in coherent texts than in short user-created queries.
A description of the fragmenter, the para-linguistic analyzer, the keytuple expander and the information retrieval engine now follow. Note that one may implement each of the above components in software or hardware or a combination of both.
THE FRAGMENTER
One embodiment of the fragmenter takes large compound documents and divides them into smaller chunks for analysis and indexing. These smaller chunks can correspond to paragraphs. In one embodiment the fragmenter determines the fragments by either word processor codes or conventional separations (such as the blank lines separating paragraphs). The system uses a fragmenter because keytuples are typically more precise approximations of meaning than individual keywords and are less likely to appear outside of the discursive contexts for which they are sought.
Optionally, the fragmenter can include original positional information with generated fragments; this information indicates the original position of the fragment within the document and can be used to associate particular document locations with particular extracted keytuples. Such information, when carried along with extracted keytuples and appropriately handled by the information retrieval engine, allows the invention to identify "virtual document fragments" based on proximity and crossing the fragmentation boundaries selected by the fragmenter. For example, embodiments of the invention could identify locations where a system according to the invention extracted the keytuples <analyze,text> and <retrieve,information> within 100 characters of each other. This 100-character fragment (or a small expansion of it) is a virtual fragment created by the search process from an underlying positional representation.
THE PARA-LINGUISTIC ANALYZER (PLA)
Robust and efficient extraction of meaning from unrestricted natural language text remains a challenge. Embodiments of the present invention can use a range of methods, which are aimed at extracting possible word relations within a text but not extracting meanings. These methods use very simple syntactic rules and perform morphological analysis on individual word forms. Ken W. Church provided one of the earliest applications of such methods in a 1980 paper entitled "On Memory Limitations in Natural Language Processing," published in MIT Laboratory of Computer Science Technical Report MIT/LCS/TR-245, and incorporated herein by reference in its entirety. In this application, automatic analysis heuristically extracted possible triples from text. A major component of such algorithms is part-of-speech determination, using methods such as hidden Markov models as described by D. Cutting, J. Kupiec, J. Pedersen and P. Sibun in al992 paper entitled "A Practical Part-of-Speech Tagger" published in Proc. 3rd ANLP, Trento, Italy, between pages 133-140 and incorporated by reference herein in its entirety. Alternatively one can use hand coded methods such as those described by Eric Brill, in a 1995 paper entitled "Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging, Computational Linguistics" and incorporated herein by reference in its entirety.
The purpose of the para-linguistic analyzer is not to extract unambiguous meaning or logical form, but to identify significant combinations of words from the source documents that indicate relationships and relevant context.
Para-linguistic methods explicitly over generate possible word relations to compensate for their relative lack of precision in analysis. For example, a text such as "John saw the woman in the mirror" might generate relationships <saw,in,mirror> and <woman,in,mirror> even though common sense tells the uninitiated reader that it is unlikely that the woman was "in the mirror". However, para-linguistic analysis does not identify such subtleties and so prefers to over-generate relations, such as <woman,in,mirror> to make up for the deficient understanding. In one embodiment a PLA begins by tagging individual words in a text with parts of speech and determining root forms. For example, as shown in Figure 5, the PLA analyzes the sentence "Tomorrow's meetings with Kodak will be in the Rainsford Room" by determining the root forms of the words (i.e., Tomorrow, meeting, with, Kodak, will, be, in, the, Rainsford Room) and tags the root forms with parts of speech data (i.e., modifier, noun, preposition, name, auxiliary, verb, preposition, article, and name, respectively). The PLA then produces keytuples which connect adjectives to subsequent nouns and nouns to subsequent verbs.
For example, as shown in Figure 6, the PLA couples a modifier (tomorrow) with a neighboring noun (meeting). Similarly, the PLA couples a noun (meeting) with a neighboring preposition (with) and name (Kodak) and couples a name (Kodak) with a preposition (in) and another name (Rainsford Room). The PLA also connects nouns and verbs to prepositional arguments and their objects. In different languages, these simple rules would be different and special cases might apply.
In one implemented embodiment, the general procedure for para-linguistic analysis has the following structure:
1. Break the input fragment K into a vector of words W[i] 2. Determine the likely parts of search P[i] and linguistic root forms R[i] for each W[i]
3. For each W[i]: a. If P[i] is 'adjective', find the next W[j] (j>i) such that P[j] is 'noun' and record the tuple <R[i],R£j]>. b. If P[I] is 'verb', then find the closest preceding W[j] (j<I) for which P[j] is 'noun' and record the tuple <R[j],R[i]>. c. If P[I] is 'preposition' then find the next W£j] such that P[j] is 'noun' and then record the tuple <W[i],R[j]> and iterate over the preceding words W[k] (k<i): i. If P[k] is 'noun' record the tuple <R[k],W[I],R[j]> ii. If P[k] is 'verb' record the tuple <R[k],W[I],R[j]> and exit the iteration (c)
Many other implementations are possible with the same general logical structure. For example, Figure 7 depicts a special case for English where the PLA handles the passive construction to produce a keytuple that reflects the object-verb-by-subject structure of the English passive. More specifically, the PLA analyzes the sentence "the public presentations will be followed by cocktails" to determine root forms and to tag the root forms (i.e., the, public, presentation, will, be, followed, by, cocktail) with parts of speech data (i.e., article, modifier, noun, auxiliary, be, verb, preposition, and noun, respectively). Next, the PLA applied in an English context uses conventional techniques to determine that the sentence is in the passive voice. Given that the sentence is in the passive voice, the PLA constructs a keytuple, i.e., <cocktail, follow, presentation^ which reflects the object- verb-by-subject structure of the English passive.
Note that in this embodiment, the PLA treats certain compound nouns and proper names as single lexical tokens, so that the PLA analyzes "Rainsford Room" as a single word and relates the phrase to other words as a unit.
While the above-described PLA is one embodiment, one can construct another embodiment of a PLA by analysis of a large corpus and by the extraction of word pairs that commonly co- occur within some distance of each other; simple para-linguistic analysis would then consist of filtering a text for such common word pairs.
Finally, the PLA may use the optional positional information provided by the fragmenter to associated a text document position with each keytuple. This text document position would be passed through the keytuple expander and then onto the information retrieval engine.
THE KEYTUPLE EXPANDER (KE)
There are different embodiments of the KE. In a lexical mode, the KE expands an arbitrary keytuple by expanding the individual words within the keytuple to synonymous words looked up in a thesaurus to create sets of synonyms and then creates a set of keytuples that typically include at most one synonym from each of the sets of synonyms. Figure 8 illustrates the lexical mode. The keytuple is <meeting, with, Kodak>. The KE expands the word "meeting" to include synonyms "conference" and "discussion" and the word "Kodak" to include synonym "Eastman Kodak." The KE then creates a set of keytuples that typically include at most one synonym from each of the sets of synonyms, e.g., <discussion, with, Eastman Kodak>, <conference, with, Eastman Kodak>, and <conference, with, Kodak>.
In a inference mdoe, the KE can use a keytuple thesaurus to expand particular keytuples in particular ways. Figure 9 illustrates the inference mdoe, combined with the lexical mode of Figure 8. First the KE expands the keytuple <meeting, with, Kodak> using a tuple thesaurus to obtain a first set of keytuples: <talk, with, Kodak>, <see, Kodak>, <meeting, with, Kodak>, and <meet, with, Kodak>. Then the KE applies the lexical mode to at least part of the first set of keytuples to obtain a second set of keytuples. In other words, for each subject keytuple from the first set of keytuples, the KE expands the keytuple by expanding the individual words within the keytuple to synonymous words looked up in a thesaurus to create sets of synonyms and then creates a second set of keytuples in which each keytuple in the second set of keytuples typically includes at most one synonym from each of the sets of synonyms.
One design choice in lexical mode involves the character of the thesaurus and how it is used. As shown in Figure 10, a word like 'meeting' may have no direct synonyms but may have near synonyms which have a more precise meaning, e.g., conference, sales call, or board meeting, or a more general meaning, e.g., interaction, gathering, or discussion. For example, in a trip report from a sales representative, meeting can often be taken as meaning "sales call". Likewise, searches on databases of business transactions are unlikely to include the sense of "meeting" which includes religious revivals or services. For different situations or applications, or when analyzing documents from different sources, different thesauri may be applied in the expansion.
The latter case demonstrates how the range in which the expansion occurs may also be subject to the genre and character of the database being searched. For example, some synonyms may only apply to word meanings outside of the scope of the database and the use of these synonyms in expansions will be either irrelevant (the expansions are not found) or erroneous (the expansions get the wrong meaning despite the contextual information of the tuple). Considerations of genre and character can directly effect search results. For example, when searching a collection of databases for the compound noun "sales calls," searches of sales representative trip reports could expand to the word "meetings". Likewise, searches on the same databases for meetings would not be expanded to the term "services" (as it might in a pastoral religious genre).
The rules for expanding keytuples may reflect the structure or tagging information, i.e., the parts-of-speech data, attached to the keytuple. For instance, the KE, when expanding the triple <'meet7m','Paris'> might not expand the preposition 'in'. The rules used by the KE may also include language-specific semantic preferences, so that the KE might expand <'shot','at',?> into <'fire','at',?> but not <'photograph','atV?'>. Such rules are necessarily language specific and may also be part of the specific application of the invention to a language and domain. For example, applied to a domain of surgical reports, the term
'separate' might be expanded to include 'cut,' but this expansion would be inappropriate in most everyday domains.
The inference mdoe of the KE provides for two related functions.
First, certain keytuples strongly indicate meanings of words that might license wider expansion than would otherwise be wise. There are no general criteria for identifying such keytuples but the class of criteria can often be organized around particular patterns of verb or noun usage. One can look at how a verb like "light" determines that its argument can be expanded more aggressively than typical for any verb. This license for expansion could be limited, at the same time, to certain categories and kinds of relations among synonyms, near- synonyms, or otherwise associated terms. For example, a tuple such as <'ligh1.,'fire'> might be readily expanded into <'lightyflame'> even if the default expansion rules of lexical mode might rule out the expansion.
Second, the inference mdoe of the keytuple expander provides a certain inferential component to the search process. For example, a tuple like <'fire','gun'> can expand into a tuple like <'puHytrigger'>, indicating a relationship, which is not an equivalence of meaning but an inference of circumstance. Such an inference of circumstance is related to the use of inference networks in information retrieval to expand from particular keywords or keyword combinations to other keywords. H. Turtle and W. Croft in a 1991 paper entitled "Evaluation of inference network-based retrieval methods" published in ACM Transactions on Information Systems, 9(3): 187 — 222 and incorporated herein by reference in its entirety discusses the use of inference networks. In the case of embodiments of the present invention, however, keytuples provide for a more reliable and robust expansion than do keywords.
The table used in inference mdoe of the keytuple expander can be constructed either by hand or by statistical methods over a corpus of texts. Commonly co-occurring keytuples can be entered into this table, as can keytuples that do no co-occur with each other but repeatedly co- occur with the other similar keytuples. Various methods of textual data mining, statistical analysis, and automated thesaurus creation, normally applied to individual keywords or co- occurring keyword pairs, can be applied to keytuples in order to create this table. One approach to such generation is discussed by Gregory Grefenstette in his 1993 University of Pittsburgh PhD thesis entitled "Automatic Thesaurus Discovery Via Selective Natural Language Processing: A Corpus Based Approach" and incorporated herein by reference in its entirety.
One embodiment of the structure of a process employed by the keytuple expander is shown in Figures 8 and 9.
THE RETRIEVAL ENGINE
The retrieval engine may use a variety of different algorithms and methods developed over decades of research on information retrieval systems. The key function of the retrieval engine is to take a set of "search keys" and return a set of documents based on those keys. This function may be implemented in numerous ways to reflect the varying degrees of importance of particular search keys either in general or with respect to a particular document.
One embodiment of the retrieval engine is an engine that employs the vector space method (discussed above). In this method, documents, fragments, and/or queries are represented by large sparse vectors where each component in the vector corresponds to a particular keytuple and the components contain zero if the document or query doesn't contain the keytuple (thus the vectors are always sparse) and contains 1 or a weight (possibly based on other criteria) if the keytuple has been generated by analysis and expansion from the document, fragment, or query. A common criterion for determining the weight is the frequency of a term, either within a document or within the entire corpus, or their combination. For instance, one standard metric is to weigh the term by the product of the term frequency (typically normalized to account for document size) and the inverse document frequency (how many documents contain the term, again typically normalized with respect to the number of terms and scaled logarithmically). This takes into account the greater prominence of common terms within a document and the typically lower discriminative utility of terms that occur frequently across documents.
One version of the vector space method that can be used is the "cosine method" which compares queries and documents by measuring the angle (in a very high dimensional space) between their vectors. The cosine method has been used extensively in information retrieval where the vector elements correspond to keywords. The sparse vectors for keytuples are much larger and sparser than for keywords. On the other hand, because keytuples are much less likely to occur multiple times in a document, weighting calculations can sometimes avoid tracking term frequency within a document.
There are numerous other methods and metrics which can be applied in the information retrieval engine. Nearly any retrieval method that functions for keywords can be extended to apply to keytuples. However, the implementation of these methods typically requires additional optimizations and modified data structures to deal with the fact that the space of possible keytuples is much larger than the space of possible keywords. For example, many modern implementations of vector space methods rely on manipulating compact vectors in physical memory where terms are associated with particular vector position or index. Thus, a word like "fire" may be associated with the index 373 (for instance) in a number of tables describing documents and the corpus as a whole. This is feasible where the number of terms may only run into the tens of thousands, but is infeasible with keytuples, where the number of terms may run into the millions. Alternative optimizations, such as using hash tables or tree structures, must then replace the position-indexed tables of keyword-based approaches. This is indicative of the kinds of adaptations, which must be made to conventional keyword-driven information retrieval algorithms in order to function efficiently and effectively with keytuples. Embodiments of the information retrieval engine associate a text fragment or text fragment identifier with a set of key terms, which are keytuples extracted and expanded from the fragment. One embodiment of the information retrieval engine, given a set of terms, returns and ranks documents based on similarity of the sets of analyzed key terms. In the case, where the embodiment uses original positional information, it should also be able to associate such terms with documents and positions and return derived terms occurring near one another in particular documents.
Having thus described at least one illustrative embodiment of the invention, various alterations, modifications and improvements are contemplated by the invention including the following: the addition of keytuple expansion rules by dynamic learning and user instruction; the analysis of the statistical inter-dependency of keytuples in comparing keytuple descriptions; and the expansion of keytuples across natural languages to support inter-lingual text searching. Such alterations, modifications and improvements are intended to be within the scope and spirit of the invention. Accordingly, the foregoing description is by way of example only and is not intended as limiting. The invention's limit is defined only in the following claims and the equivalents thereto.

Claims

What is claimed is:
1. A system for managing at least one data item, the system comprising: a para-linguistic analyzer operative to receive search data and to identify a first keytuple included in the search data; a keytuple expander in communication with the para-linguistic analyzer and operative to generate a set of keytuples associated with the first keytuple; and an information retrieval engine in communication with the keytuple expander and operative to manage at least one data item based at least in part on the set of keytuples.
2. The system of claim 1 wherein the system further comprises: a fragmenter in communication with the para-linguistic analyzer and operative to receive documents and to separate the documents into a plurality of fragments, wherein the para-linguistic analyzer is operative to use a first fragment from the plurality of fragments as search data, and wherein the information retrieval engine associates the set of keytuples with the first fragment.
3. The system of claim 2 wherein the plurality of fragments are paragraphs.
4. The system of claim 1 wherein the search data is a query and wherein the information retrieval engine is operative to rank data items by comparing keytuple sets associated with data items with keytuple sets associated with the query.
5. The system of claim 1 wherein the search data is a text fragment and wherein the information retrieval engine is operative to rank data items by comparing keytuple sets associated with data items with keytuple sets associated with the query.
6. The system of claim 1 wherein the first keytuple is a pair of words with linguistic significance to the search data.
7. The system of claim 1 wherein the first keytuple is three words with linguistic significance to the search data.
8. A method for managing at least one data item, the method comprising: receiving search data; identifying a first keytuple included in the search data; generating a set of keytuples associated with the first keytuple; and managing at least one data item based at least in part on the set of keytuples.
9. The method of claim 8 wherein receiving search data comprises: receiving a document; separating the document into a plurality of text fragments; and using a first text fragment from the plurality of text fragments as the search data.
10. The method of claim 9 wherein the text fragments are paragraphs.
11. The method of claim 9 wherein managing at least one data item comprises associating the generated set of keytuples with the first text fragment.
12. The method of claim 8 wherein the search data is a text fragment.
13. The method of claim 8 wherein identifying a first keytuple comprises identifying a plurality of first keytuples and wherein generating a set of keytuples comprises generating a set of keytuples for each of the plurality of first keytuples.
14. The method of claim 8 wherein the search data is text and wherein identifying the first keytuple comprises: associating words in the text with parts-of-speech data; determining root forms of words in the text; and connecting the root forms of words based on the parts-of-speech data associated with the words.
15. The method of claim 8 wherein a keytuple is a plurality of words with linguistic significance to the search data.
16. The method of claim 15 wherein generating a set of keytuples comprises expanding the words of the keytuple to natural language synonyms.
17. The method of claim 15 wherein generating a set of keytuples comprises: expanding the first keytuple to keytuple synonyms to create a first set of keytuples; and expanding the words of each of the keytuples in the first set of keytuples to natural language synonyms to create a second set of keytuples.
18. The method of claim 8 wherein the search data is a query and wherein managing a data item comprises: ranking a set of data items by comparing keytuple sets associated with data items with keytuple sets associated with the query.
19. The method of claim 8 wherein the search data is a fragment and wherein managing a data item comprises: ranking a set of data items by comparing keytuple sets associated with data items with keytuple sets associated with the fragment.
20. A system for managing at least one data item, the system comprising: para-linguistic analyzer means for receiving search data and for identifying a first keytuple included in the search data; keytuple expander means in communication with the para-linguistic analyzer means, the keytuple expander means for generating a set of keytuples associated with the first keytuple; and information retrieval means in communication with the keytuple expander means, the information retrieval means for managing at least one data item based at least in part on the set of keytuples.
PCT/US2003/019338 2002-06-17 2003-06-17 Para-linguistic expansion WO2003107141A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003253663A AU2003253663A1 (en) 2002-06-17 2003-06-17 Para-linguistic expansion

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US38918802P 2002-06-17 2002-06-17
US60/389,188 2002-06-17

Publications (2)

Publication Number Publication Date
WO2003107141A2 true WO2003107141A2 (en) 2003-12-24
WO2003107141A3 WO2003107141A3 (en) 2004-05-06

Family

ID=29736601

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/019338 WO2003107141A2 (en) 2002-06-17 2003-06-17 Para-linguistic expansion

Country Status (2)

Country Link
AU (1) AU2003253663A1 (en)
WO (1) WO2003107141A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005111860A1 (en) * 2004-05-13 2005-11-24 Robert John Rogers A system and method for retrieving information and a system and method for storing information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5933822A (en) * 1997-07-22 1999-08-03 Microsoft Corporation Apparatus and methods for an information retrieval system that employs natural language processing of search results to improve overall precision
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US6076088A (en) * 1996-02-09 2000-06-13 Paik; Woojin Information extraction system and method using concept relation concept (CRC) triples

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5963940A (en) * 1995-08-16 1999-10-05 Syracuse University Natural language information retrieval system and method
US6076088A (en) * 1996-02-09 2000-06-13 Paik; Woojin Information extraction system and method using concept relation concept (CRC) triples
US6263335B1 (en) * 1996-02-09 2001-07-17 Textwise Llc Information extraction system and method using concept-relation-concept (CRC) triples
US5933822A (en) * 1997-07-22 1999-08-03 Microsoft Corporation Apparatus and methods for an information retrieval system that employs natural language processing of search results to improve overall precision

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005111860A1 (en) * 2004-05-13 2005-11-24 Robert John Rogers A system and method for retrieving information and a system and method for storing information
GB2430058A (en) * 2004-05-13 2007-03-14 Robert John Rogers A system and method for retrieving information and a system and method for storing information
US7752196B2 (en) 2004-05-13 2010-07-06 Robert John Rogers Information retrieving and storing system and method

Also Published As

Publication number Publication date
AU2003253663A1 (en) 2003-12-31
AU2003253663A8 (en) 2003-12-31
WO2003107141A3 (en) 2004-05-06

Similar Documents

Publication Publication Date Title
US7398201B2 (en) Method and system for enhanced data searching
US6678677B2 (en) Apparatus and method for information retrieval using self-appending semantic lattice
US7509313B2 (en) System and method for processing a query
US7266553B1 (en) Content data indexing
Moldovan et al. Using wordnet and lexical operators to improve internet searches
US6947920B2 (en) Method and system for response time optimization of data query rankings and retrieval
EP0597630B1 (en) Method for resolution of natural-language queries against full-text databases
US9600532B2 (en) Systems and method for searching an index
US6584470B2 (en) Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction
US6167397A (en) Method of clustering electronic documents in response to a search query
US7882143B2 (en) Systems and methods for indexing information for a search engine
Varma et al. IIIT Hyderabad at TAC 2009.
US20070136251A1 (en) System and Method for Processing a Query
US20100042589A1 (en) Systems and methods for topical searching
WO2010019873A1 (en) Systems and methods utilizing a search engine
US20060259510A1 (en) Method for detecting and fulfilling an information need corresponding to simple queries
Figueroa et al. Contextual language models for ranking answers to natural language definition questions
Strzalkowski Natural language processing in large-scale text retrieval tasks
US20040039562A1 (en) Para-linguistic expansion
Khoo The use of relation matching in information retrieval
JP2894301B2 (en) Document search method and apparatus using context information
WO2003107141A2 (en) Para-linguistic expansion
Ketui et al. Thai multi-document summarization: Unit segmentation, unit-graph formulation, and unit selection
Rinaldi et al. Answer extraction in technical domains
Kanitha et al. Issues in Malayalam Text Summarization

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP