United States Patent [w]
Wyard et al.
US006167398A [ii] Patent Number:  Date of Patent:
 INFORMATION RETRIEVAL SYSTEM AND METHOD THAT GENERATES WEIGHTED COMPARISON RESULTS TO ANALYZE THE DEGREE OF DISSIMILARITY BETWEEN A REFERENCE CORPUS AND A CANDIDATE DOCUMENT
 Inventors: Peter J Wyard, Woodbridge; Tony G Rose, Guildford, both of United Kingdom
 Assignee: British Telecommunications public limited company, London, United Kingdom
 Appl. No.: 09/068,452
 PCT Filed: Jan. 30, 1998
 PCT No.: PCT/GB98/00294
§ 371 Date: May 13, 1998
§ 102(e) Date: May 13, 1998  PCT Pub. No.: WO98/34180
PCT Pub. Date: Aug. 6, 1998  Foreign Application Priority Data Jan. 30, 1997 [GB] United Kingdom 9701866
 Int. CI.7 G06F 17/30
 U.S. CI 707/5; 707/2; 707/3; 707/4
 Field of Search 707/5, 10, 2, 3,
 References Cited
U.S. PATENT DOCUMENTS
5,625.767 4/1997 Bartell et al 345/440
5,724,571 3/1998 Woods 707/5
5,873,076 2/1999 Barr et al 707/3
5,907,839 5/1999 Roth 707/5
5,937,422 8/1999 Nelson et al 707/531
FOREIGN PATENT DOCUMENTS
0687987 Al 12/1995 European Pat. Off. . WO 92/04681 3/1992 WIPO .
WO 96/32686 10/1996 WIPO .
W. Bruce Croft, Intelligent Internet Services Effective Text Retrieval Based on Combining Evidence from the Corpus and Users, vol. 10 issue 6 IEEE electronic library online, pp.59-63, Dec. 1995.
Besancon et al., Textual Similarities Based on a Distributional Approach, IEEE electronic library online, p. 180-184, Sep. 1999.
Chapter 4 of the book "Introduction to Modern Information Retrieval" by G. Saltan, published by McGraw Hill, 1983.
Dunning, "Accurate Methods for the Statistics of Surprise and Coincidence", Computational Linguistics, vol. 19, No. 1, 1993.
Katz, "Estimation of Probabilities from Sparse Data", IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-35, 1987.
(List continued on next page.)
Primary Examiner—-John Breene
Assistant Examiner—Greta L. Robinson
Attorney, Agent, or Firm—Nixon & Vanderhye PC.
An internet information agent accepts a reference document, performs an analysis upon it in accordance with metrics defined by its analysis algorithm and obtains respective lists (word, character-level n-gram, word-level n-gram), derives weights corresponding to the metrics, applies the metrics to a candidate document and obtains respective returned values, applies the weights to the returned values and sums the results to obtain a Document Dissimilarity (DD) value. This DD is compared with a Dissimilarity Threshold (DT) and the candidate document is stored if the DD is less than the DT. A user can apply relevance values to the search results and the agent modifies the weights accordingly. The agent can be used to improve a language model for use in speech recognition applications and the like.
18 Claims, 4 Drawing Sheets
USER CLICKS ON AGENT BUTTON
USER ENTERS URL OF REFERENCE DOCUMENT
AGENT RETRIEVES REFERENCE DOCUMENT
40 AGENT DERIVES DOCUMENT DISSIMILARITY
AGENT COMPARES DOCUMENT DISSIMILARITY
WITH DISSIMILARITY THRESHOLD
AGENT WRITES CANDIDATE DOCUMENT
TO RETAINED TEXT STORE
IF DD LESS THAN DT
Jelinek, "Self-Organised Language Modelling for Speech Recognition", Readings in Speech Recognition, edited by A. Waibel and K. Lee, published by Morgan Kaufmann, 1990. Pearce et al, Generating a Dynamic Hypertext Environment with n-gram Analysis, Proceedings of the International
Conference on Information and Knowledge Management CIKM, Nov. 1, 1993, pp. 148-153, XP000577412.
Wong et al, "Implementations of Partial Document Ranking Using Inverted Files", Information Processing & Management (Incorporating Information Technology), vol. 29, No. 5, Sep. 1993, pp. 647-669, XP002035616.