WO1997010537A2 - Method and apparatus for identifying textual documents and multi-media files corresponding to a search topic - Google Patents

Method and apparatus for identifying textual documents and multi-media files corresponding to a search topic Download PDF

Info

Publication number
WO1997010537A2
WO1997010537A2 PCT/US1996/015664 US9615664W WO9710537A2 WO 1997010537 A2 WO1997010537 A2 WO 1997010537A2 US 9615664 W US9615664 W US 9615664W WO 9710537 A2 WO9710537 A2 WO 9710537A2
Authority
WO
WIPO (PCT)
Prior art keywords
information
text
document
infoimation
image
Prior art date
Application number
PCT/US1996/015664
Other languages
French (fr)
Other versions
WO1997010537A3 (en
Inventor
Thomas Barr
James T. Beattie
Lawrence A. Husick
Joshua Kopelman
Michael S. Krupit
Howard Morgan
Edwin Howell Watkeys, Iii
Marvin I. Weinberger
Original Assignee
Infonautics Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US08/529,250 external-priority patent/US5742816A/en
Priority claimed from US08/529,233 external-priority patent/US5675788A/en
Priority claimed from US08/528,683 external-priority patent/US5659742A/en
Application filed by Infonautics Corporation filed Critical Infonautics Corporation
Priority to AU72026/96A priority Critical patent/AU7202696A/en
Publication of WO1997010537A2 publication Critical patent/WO1997010537A2/en
Publication of WO1997010537A3 publication Critical patent/WO1997010537A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data

Definitions

  • the present invention is directed to systems for identifying documents corresponding to a search topic or query. More particularly, the present invention is directed to an automated multi-user system for identifying and retrieving text and multi-media files related to a search topic from a database library composed of information from many various publisher sources.
  • Information retrieval systems are designed to store and retrieve information provided by publishers covering different subjects. Both static information, such as works of literature and reference books, and dynamic information, such as newspapers and periodicals, are stored in these systems.
  • Information retrieval engines are provided within prior art information retrieval systems in order to receive search queries from users and perform searches through the stored information. It is an object of most information retrieval systems to provide the user with all stored information relevant to the query.
  • searching/retrieval systems are not adapted to identify the best or most relevant information yielded by the query search.
  • Such systems typically retum query results to the user in such a way that the user must retrieve and view every document retumed by the query in order to determine which document(s) is/are most relevant.
  • the searching system could retum to the user not only a single list having both text and multi-media information relevant to the query search, but also a list which enabled the user to readily identify which ofthe text and multi-media files were most relevant to the query topic.
  • Each different publisher providing documents that may be retrieved by information retrieval systems typically uses its own information format to store and transmit its information files.
  • an information searching/retrieval system which has a library database based upon information from many various publishers must be compatible with many different publisher formats. This compatibility requirement can serve to slow the performance of an information searching/retrieval system.
  • semantic networks may be useful for finding additional relevant documents responsive to a query, it is believed that use of such networks also tends to increase the number of irrelevant documents located by the search. In fact, it is generally believed that the number of additional relevant documents identified through the use of semantic networks is roughly equal to the number of irrelevant documents which are also brought into the search results list as a result ofthe semantic network. It would be desirable to have a system for implementing a semantic network which maximized the number of relevant documents identified during the search, without substantially increasing the number of irrelevant documents found by the search.
  • the present invention is directed to a method and apparatus for identifying textual documents and multi-media files corresponding to a search topic.
  • a plurality of document records, each of which is representative of at least one textual document, are stored, and a plurality of multi-media records, each of which is representative of at least one of multi-media file, are also stored.
  • the document records have text information fields associated therewith, each ofthe text information fields representing text from one ofthe plurality of textual documents.
  • the multi-media records have multi-media infoimation fields for representing only digital video (i.e.. still images or motion video image sequences), digital audio or graphics infoimation, and associated text fields, each ofthe associated text fields representing text associated with one ofthe multi-media information fields.
  • a single search query corresponding to the search topic is received.
  • the single search query is preferably in a natural language format.
  • An index database is searched in accordance with the single search query to simultaneously identify document records and multi-media records related to the single search query.
  • the index database has a plurality of search terms corresponding to terms represented by the text infoimation fields and the associated text fields.
  • the index database also includes a table for associating each ofthe document and multi-media records with one or more ofthe search terms.
  • a search result list having entries representative of both textual documents and multi-media files related to the single search query is generated in accordance with the document records and the multi-media records identified by the index database search. Text corresponding to the search topic is retrieved by selecting entries from the search result list representing document records to be retrieved, and then retrieving text represented by the text information fields associated with the selected document records.
  • Digital video, audio or graphics information corresponding to the search topic is retrieved by selecting entries from the search result list representing selected multi-media records to be retrieved, and then retrieving digital video, audio or graphics infoimation represented by multi-media information fields associated with the selected multi-media records.
  • the present invention is directed to a computer-implemented method and apparatus for composing a composite document on a selected topic from a plurality of information sources by searching the plurality of information sources and identifying, displaying and copying files corresponding to the selected topic
  • a plurality of records, each of which is representative of at least one info ⁇ nation file, are stored in a database
  • a single search query corresponding to the search topic is received
  • the database is searched in accordance with the single search query to identify records related to the single search query
  • a search result list is then generated having entries representative of information files identified during the database search, and the search result list is displayed in a first display window open on a user display
  • Signals representative of at least first and second selected entries from the search result list are received from the user, the first and second selected entries respectively corresponding to first and second information files
  • a second display window for displaying at least a portion ofthe first information file is opened on the user display
  • a third display window for displaying at least a portion ofthe second infoimation file is opened
  • the present invention is directed to a split-server architecture for processing a search query provided by a user, and identifying and retrieving documents from a database corresponding to the search query
  • a session server is provided for receiving the search query from the user
  • the session server has at least a first processor coupled to the user over a communications channel
  • a query server is coupled to the session server
  • the query server has at least a second processor coupled to a first database having records representative ofthe documents to be searched
  • the query server includes means for receiving the search query from the session server, searching means for searching the first database to identify documents responsive to the search query, and means for sending search results information representative ofthe documents identified by the searching means from the query server to the session server
  • the session server includes means for sending the search query to the query server, means for receiving the search results information from the query server, means for sending a search results list representative ofthe search results information across the communications channel to the user, means for receiving a document retrieval request transmitted from the user over the communications channel, means for retrieving a document in response to the retrieval
  • the present invention is directed to a method for determining a part of speech of words in a sentence or sentence fragment
  • a hidden Markov model for determining the most likely part of speech for the words in the sentence or sentence fragment is provided, wherein the hidden Markov model has an initial transition matrix and a subsequent transition matrix for storing the probabilities of transitions from one part of speech to another
  • the initial matrix ofthe hidden Markov model is effectively removed by making the probabilities therein equal to each other to provide a modified hidden Markov model
  • the modified hidden Markov model is applied to the sequence of words to determine the most likely part of speech of words within a sentence fragment with increased accuracy
  • the present invention is directed to a method for storing input information in an infoimation retrieval system database wherein a plurality of information subject categories are provided A plurality of subject lexicons are provided, each subject lexicon ofthe plurality of subject lexicons corresponding to an information subject category ofthe plurality of information subject categories Each subject lexicon contains information representative of its corresponding information subject category The input information is compared with the subject lexicons and the input information is stored in a selected information subject category according to the comparing ofthe input infoimation with the subject lexicons
  • the present invention is directed to a method for storing information in an infoimation retrieval system having a database for retrieval ofthe input information in response to a query
  • Text infoimation representative of text is received for storing in the system
  • Image information representative of an image is also received for storing in the system
  • image text information representative of text associated with the image information is received
  • the image information is stored in an image information format
  • the text information and the image text information are stored in a common text infoimation format whereby the format ofthe stored text infoimation is identical to the format ofthe stored image text information
  • the text infoimation and image text information are searched in the common text information format and the text information and image text information are identified in response to a single query
  • the image information associated with the retrieved image text information is selected and the selected image infomiation is retrieved whereby the text information and the image information are retrieved in accordance with the same query
  • the present invention is directed to a method for searching a database of an information retrieval system in response to a query having at least one query word with a part of speech, for applying the query word to the database and selecting information from the database according to the query word
  • a semantic network is provided for determining expansion words to expand the search ofthe database in response to the query word.
  • the part of speech ofthe selected query word is determined
  • the selected query word is applied to the semantic network to provide one or more query expansion words in response to the selected query word
  • the part of speech of the query expansion word is determined
  • the query expansion word is applied to the database in accordance with the part of speech ofthe selected query word and the part of speech ofthe query expansion word
  • the present invention is directed to a method for performing a search of a database in an information retrieval system in response to a query having at least one query word with a query word weight and for applying the query word to the database and selecting information from the information retrieval system in accordance with the query word.
  • a query word is selected and assigned a weight.
  • the weight is adjusted depending on whether the query word is a proper noun or slow word.
  • the adjusting can be an increase or a decrease in the weight.
  • Information is selected from the information retrieval system in accordance with the adjusted weight.
  • the present invention is directed to a method for searching a database of an information retrieval system in response to a query having a query length of at least one word, for applying the query word to the database and selecting information from the database according to the query word.
  • the query is received and the length ofthe query is determined.
  • Infoimation is selected from the database according to the query.
  • the relevance ofthe selected information is determined according to matches between the query and the information.
  • the determined relevance ofthe selected infoimation is adjusted according to the length ofthe query.
  • the present invention is directed to a method for searching an infoimation retrieval system having a database containing a plurality of documents from a plurality of document sources in response to a query from a user.
  • a document log table is provided for tabulating document information of documents selected by the user in response to a query from the user.
  • the query is received from the user and a document is selected by the user in response to the received query.
  • the document log table is adjusted in response to the selecting ofthe document.
  • the adjusted log table can be used to determine royalties.
  • FIG. 1 is a simplified block diagram showing an infoimation retrieval system in accordance with a preferred embodiment ofthe present invention.
  • FIG 2 is a simplified process flow diagram illustrating a user session which may be performed with the infoimation retrieval system shown in Figure 1, in accordance with a preferred embodiment ofthe present invention.
  • Figure 3 is a more detailed block diagram showing an information retrieval system in accordance with a preferred embodiment ofthe present invention.
  • Figure 4 is a more detailed process flow diagram illustrating a user session which may be performed with the information retrieval system shown in Figure 3, in accordance with a preferred embodiment ofthe present invention.
  • Figure 4A is a diagram illustrating an exemplary search results list displayed in an open window on a user's personal computer, in accordance with a preferred embodiment ofthe present invention.
  • Figure 4B is an exemplary diagram illustrating first and second open windows on a user's personal computer which respectively display text and video information corresponding to document and multi-media files selected by the user for retrieval, in accordance with a preferred embodiment ofthe present invention.
  • Figure 4C is an exemplary diagram illustrating first and second open windows on a user's personal computer which respectively display text and video information corresponding to document and multi-media files selected by the user for retrieval, and a composite document window in which the user has built a composite document based on the text and video information in the first and second windows, in accordance with a preferred embodiment ofthe present invention.
  • Figure 5 is a diagram illustrating preferred data structures for storing a document information directory table, a dependent image table, and publisher information table, in accordance with a preferred embodiment ofthe present invention.
  • Figure 5A is a diagram illustrating a preferred data structure for implementing a document index database, in accordance with a preferred embodiment ofthe present invention.
  • Figure 5B is a diagram illustrating a preferred data storage format for implementing an image/text database, in accordance with a preferred embodiment ofthe present invention.
  • Figure 6 is a block diagram illustrating the operation of software systems for implementing the session and query managers shown in Figure 4, in accordance with a preferred embodiment ofthe present invention.
  • Figure 6A is a state flow diagram showing the operation ofa session manager software system, in accordance with a preferred embodiment ofthe present invention.
  • Figure 6B is a flow diagram showing the operation of a search engine software system, in accordance with a preferred embodiment ofthe present invention.
  • Figure 7A is a block diagram ofa hidden Markov model suitable for parsing full sentences.
  • Figure 7B is a block diagram of a hidden Markov model for parsing sentence fragments, in accordance with a preferred embodiment ofthe present invention.
  • Figure 8 A is a table of relevance normalization values for normalizing relevance scores output by a search engine, in accordance with a preferred embodiment of the present invention.
  • Figure 8B is a graph illustrating a system for normalizing relevance scores output by a search engine, in accordance with a preferred embodiment ofthe present invention.
  • Figure 9 is a block diagram representation ofthe data preparation component ofthe information retrieval system of Figure 3, in accordance with a preferred embodiment ofthe present invention.
  • Figure 9A is a block diagram representation of data flows within the data preparation component of Figure 9, in accordance with a preferred embodiment ofthe present invention.
  • Figure 10 is a block diagram representation of an automatic subjecting system for automatically determining the subject category of input documents, in accordance with a preferred embodiment ofthe present invention.
  • Figure 11 is a process flow representation of a method for generating subject lexicons for use in the automatic subjecting system of Figure 10, in accordance with a preferred embodiment ofthe present invention.
  • Figure 12 is a block diagram of a system for generating subject lexicons for use in the automatic subjecting system of Figure 10, in accordance with a preferred embodiment ofthe present invention.
  • Figure 13 is a representation of data structures within an accounting database, in accordance with a preferred embodiment ofthe present invention.
  • the information retrieval system 100 includes a user station 102 for searching infoimation files which have been collected from various publisher sources 1 12 and stored in data center 1 10.
  • the user station 102 includes a personal computer (PC) 104 and user software 106 which resides on PC 104.
  • User software 106 includes a graphical user interface (shown generally in Figures 4A, 4B and 4C).
  • the user station 102 provides search queries by way of a communications channel 108 (such as, for example, a large volume public network or the Intemet) coupled to the data center 110.
  • a communications channel 108 such as, for example, a large volume public network or the Intemet
  • the data center 1 10 includes session server 114 which includes means for receiving a search query from user station 102, means for sending the search query to a query server 116, means for receiving search results information from the query server 116, means for sending a search results list representative ofthe search results information across communications channel 108 to the user station 102, means for receiving a document retrieval request transmitted from user station 102 over communications channel 108 to session server 114, and means for retrieving a document from database 118 in response to the retrieval request and transmitting a file representative ofthe document to user station 102 over communications channel 108
  • the query server 116 at data center 110 includes means for receiving a search query from the session server 114, searching means for searching a document index database 1 17 (shown in Fig.
  • Data center 110 also includes a library database 1 18 for storing text, image, audio or other multi-media information representative of files provided by a plurality of publishers 112.
  • session server 1 14 retrieves (from library 118) documents identified by a search query and selected by a user of user station 102 for retrieval, and then transmits the selected documents to the user station 102 over channel 108.
  • a simplified process flow diagram illustrating a user session 200 which may be performed with information retrieval system 100 shown in Figure 1, in accordance with a preferred embodiment ofthe present invention.
  • the user station 102 communicates to data center 1 10 (via channel 108) a description ofthe information that a user of user station 102 would like to identify at data center 110. More specifically, in step 202 the a user of user station 102 sends a "natural language search query" to data center 110.
  • the term "natural language search query” is used to refer to a question, sentence, sentence fragment, single word or term which describes (in natural language form) a particular topic or issue for which a user of user station 102 seeks to identify information.
  • the query server 1 16 in data center 110 searches a document index database 117 (shown in Figs. 3 and 5 A) coupled to the query server, and a list of files responsive to the search query are retumed to user station 102, as shown in step 204.
  • the user of user station 102 may select for retrieval one ofthe listed files identified by data center 1 10.
  • session server 114 in data center 110 retrieves the foil text, image, audio or other multi-media infoimation associated with a selected file from the library database 118.
  • the user of user station 102 is given the option of keeping or discarding portions ofthe retrieved file. As explained more folly below, the user of user station 102 may use some or all of any retrieved file to compose a document directed to the search query.
  • FIG. 3 there is shown a more detailed block diagram showing the overall operation of information retrieval system 100, in accordance with a preferred embodiment ofthe present invention.
  • the information retrieval system shown in Figure 3 is a more detailed representation ofthe infoimation retrieval system 100 shown in Figure 1.
  • data center 110 includes image/text database 118 for storing document files representative of each ofthe publisher documents 1 12 received in data center 110.
  • the image/text database 1 18 is coupled to session server 114 which, as explained more folly below, uses image/text database 118 to retrieve the foil text/image of individual documents selected by user station 102 during a user session.
  • An accounting database 119 is also coupled to session server 114.
  • the accounting database 119 is used to store royalty and marketing information generated during operation of system 100.
  • a document index database 117 is coupled to query server 116. As explained more fully below in connection with Figure 4, document index database 1 17 contains a list of search terms corresponding to potential search terms which may appear in a search query.
  • document index database 117 For each search term listed in the document index database 117, document index database 117 stores the document identification number corresponding to each document file (stored in image/text database 1 18) that includes that search term, along with location info ⁇ nation corresponding to the location ofthe search term in each such document file
  • database 118 stores document files representative of textual documents, still digital video images, motion digital video sequences, and digital audio sequences, it will be understood by those skilled in the art that any other type of multi ⁇ media object file, such as, for example, multi-media object files containing graphical images, may also be stored on database 1 18.
  • server when used in conjunction with "session” or “query”, is used to refer to a physical machine formed from at least one computer processor having associated memory and software installed thereon for executing the functions to be performed by the server.
  • the hardware platform used for implementing session server 1 14 consists of a Tandem* 4412 computer having 2 processors, 256 MB of memory, a 2 GB system disk, and a 2 GB RAID disk; a flow diagram illustrating the operation of a preferred software system 1 14a for implementing session server 114 on this hardware platform is shown in Figure 6 and discussed in connection with Figures 4 and 6A below
  • the hardware platform used for implementing query server 116 consists of a Tandem ® 4412 computer having 2 processors, 1 GB of memory, a 2 GB system disk, and a 8 GB RAID disk
  • a flow diagram illustrating the operation ofa preferred software system 1 16a for implementing session server 1 16 on this hardware platform is shown in Figure 6 and discussed in connection with Figures 4 and 6B below
  • specific hardware is disclosed herein for implementing session server 114 and query server 116, it will be understood by those skilled in the art that other suitable hardware platforms may altematively be used to implement servers
  • This "hardware scalability" allows data center 1 10 to handle an increasing number of user stations 102 simply by adding further processors to the existing hardware used for servers 114, 1 16, without modification ofthe software running on such hardware.
  • the session manager 114 attends to various accounting and subscriber services.
  • infoimation required for perfoiming the accounting and subscriber services is stored in a separate accounting database 119.
  • session manager 114 functions in conjunction with enrollment means 120 and authentication means 122 which are also coupled to accounting database 119.
  • enrollment means 120 functions to initiate new (first time) users into data center 110.
  • Enrollment means 120 "enrolls" each new user into data center 1 10 by causing subscriber information representing the new user to be entered into accounting database 1 19. On subsequent accesses to data center 1 10 by a previously enrolled user, authentication means 122 determines that the user is already enrolled in the data center 110 by accessing records stored within the accounting database 119.
  • Data preparation within the document preparation/accounting means 126 of data center 110 includes means for document filtering, document loading and indexing of new documents received by the data center 110 from a plurality of document publishers 112.
  • Documents received from a publisher 112 are in a publisher-specific format.
  • a document received from a publisher 112 is converted to a universal publisher processing format.
  • the universal publisher processing format is a single format used by data center 110 for storing and retrieving all documents received from various publishers 112.
  • a user station 102 accesses the data center 1 10 by establishing a communications link over communications channel 108 between PC 104 and FDDI 124 within data center 110.
  • communications channel 108 may consist of a communications link formed over a public network such as the Intemet.
  • communications channel 108 may consist of a communications link foimed between PC 104 and FDDI 124 over a commercial network.
  • commercial networks such as, for example, the Prodigy* network, the CompuServe* network, or the Microsoft* network, may be used to establish a communications channel 108 for linking PC 104 and FDDI 124.
  • processing proceeds to step 310, where the authentication means 122 attempts to authenticate the user of user station 102. More particularly, based on infoimation input by the user, authentication means 122 determines whether the user is authentic, Le, whether the user is authorized to access data center 1 10. If the user is authentic, but has not used accessed data center 110 previously, processing proceeds to step 315, where enrollment means 120 optionally downloads a copy of software system 106 onto PC 104, and optionally executes software 106.
  • the enrollment means 120 requests the user to provide personal "enrollment information" about him/herself such as, for example, the user's foil name, address, grade level, gender, occupation, the occupations ofthe user's parents, the numbers of brothers and/or sisters ofthe user, the type of computer used by the user, the user's modem speed, the display capabilities ofthe user's display, the size ofthe memory ofthe user's PC, and the identity ofthe communications link (e ⁇ g,, the Intemet, the Prodigy ® network, the CompuServe ® network, or the Microsoft* network) used for accessing the system.
  • Each new user enrolled by enrollment means 120 is assigned a unique user identification number.
  • enrollment means 120 and authentication means 122 are implemented in software on a separate machine which is physically distinct from the processor(s) used for implementing session server 1 14
  • enrollment means 120 and authentication means 122 may be implemented in software as part of session server 114.
  • enrollment means 120 and authentication means 122 may be implemented on the same processors) used for implementing session server 114.
  • step 320 software 106 on PC 104 waits for the user to input a search query.
  • the search query represents a topic, issue or question about which the user desires to locate information.
  • the search query is provided by the user in a natural language format.
  • the query may be entered in the form ofa complete sentence, or as phrase or sentence fragment, such as, "Vietnam casualties”.
  • the query may be entered by the user in a non-natural language format wherein the user identifies one or more key words to be searched, whether the key words should be searched conjunctively or disjunctively, and, for key words to be searched conjunctively, the distance between which the key words must fall relative to one and other to come within the scope ofthe search.
  • processing proceeds to step 325 where software system 106 spell checks the search query.
  • the purpose of spell checking the search query prior to sending the query to data center 110 is to prevent the wasting of resources at data center 110 that would likely result if data center 1 10 attempted to process a search query with a spelling error.
  • processing proceeds to step 330, where the spell checked search query is transmitted by PC 104 over communications channel 108 to session server 114.
  • step 335 Upon receipt ofthe query by session server 114, processing proceeds to step 335, during which the search query is transmitted from the session server 1 14 to the query server 116 which processes the search query. More particularly, in this step, query server 116 scans its document index database 117 based on the user's search query in order to identify document identification numbers representing textual documents and multi-media files relevant to the search query.
  • the document index database 117 searched by query server 116 contains a list of search terms corresponding to potential search terms which may appear in or be related to words or terms in a search query. For each search term listed in the document index database 117, document index database 1 17 stores the document identification number corresponding to each document file (stored in database 118) that includes that search term, along with location information corresponding to the location ofthe search term in each such document file. Further details regarding the structure and operation of document index database 117 are shown in Figure 5B and discussed later in this specification.
  • document file is used in the present application to mean a file in database 1 18 which stores either a textual document or, altematively, a file which stores a multi-media record such as, for example, a still image, a sequence of motion digital video frames, or a sequence of digital audio frames.
  • each multi-media file in database 1 18 is stored along with a separate portion of text related to the multi-media record (referred to below as the "associated text field” or "image text -type information field” conesponding to each such multi-media file).
  • This associated text field is used as the basis for generating document index information (for storage on document index database 117) corresponding to each multi-media file stored in database 1 18.
  • document index database 117 stores the document identification number corresponding to each multi-media file having an associated text field which includes that search term, along with location infoimation conesponding to the location ofthe search term in the associated text field.
  • document index database 1 17 stores index information for both textual document files and multi-media files
  • the scan of document index database 117 by query server 116 in step 335 simultaneously identifies document identification numbers representing both textual documents and multi-media records relevant to the search query.
  • the associated text or image text-type information field associated with a multi-media record such as, for example, a still image
  • a multi-media record such as, for example, a still image
  • the identification numbers ofthe textual documents and multi-media files identified during the search are then used by the query server 1 16 to form a search results list which contains entries representing the document identification number of each textual document and multi-media file identified in the search.
  • query server 116 performs a relevance ranking on each ofthe textual documents and multi-media files identified by the search by generating a relevance score corresponding to each ofthe entries on the search result list.
  • This relevance scoring operation is performed by query server 116 based on, among other things, the term location information contained in index database 117. More particularly, the relevance score for each document file is based in part on the relative proximity within the document file of terms foiming the search query, Le.., the closer various search terms are to one and other in the document file, the higher the relevance score. Of course, for textual documents this proximity evaluation is performed based on the location ofthe search terms within the text document itself, whereas for multi- media files, this proximity evaluation is performed based on the location ofthe search terms within the associated text fields of such multi-media files.
  • the query server 1 16 forms a relevance ordered search result list by ordering the document identification numbers (or entries) in the search result list in accordance with their respective relevance rankings such that an entry having a highest relevance ranking represents a first entry on the relevance ordered search result list.
  • entries corresponding to textual documents and entries corresponding to multi-media files will typically be interspersed within the relevance ordered search result list.
  • the document index searching and relevance scoring operations performed by query server 116 are preferably implemented in part using commercially available searching software such as the ConquestTM search engine program marketed by ExcaliburTM Technologies.
  • step 340 following the generation ofthe relevance ordered search result list by query server 116, the relevance ordered search result list is transmitted from query server 116 to session server 114.
  • the search query initially provided by the user in step 320 was a recursive search query (i.e..
  • session server 114 transmits to user station 102 via channel 108; otherwise, if the query initially provided in step 320 was not a recursive search query, session server 114 simply transmits the relevance ordered search results list from query server 116 directly to user station 102. Following receipt ofthe relevance ordered search results list by user station 102, session server 114 retrieves Bibliographical information (from database 1 18) corresponding to the document identification numbers in the search results list, and this bibliographical information is then transmitted to user station 102 over channel 108.
  • the Bibliographical infoimation retrieved and transmitted in this step includes, for example, the document title, author, publisher name, publication date, publication name and edition.
  • a preferred data structure used for retrieving this Bibliographical information from database 118 is discussed below in conjunction with Figure 5.
  • user PC 104 together with software 106 display the search results list (or a portion thereof), together with (i) the relevance scores corresponding to each document identification number on the search results list (ii) Bibliographical information corresponding to each document identification number on the search results list, and (iii) an indication whether the document file corresponds to a textual document or a multi-media file such as, for example, a file representing a still image.
  • This information is displayed in a display window on PC 104 in relevance ranked order such that the document file(s) with a highest relevance score is/are displayed at the top ofthe list.
  • Figure 4A shows an exemplary search results list displayed in an open window 341 on PC 104.
  • the image displayed in window 341 includes relevance scores 342, Bibliographical infoimation 343, readability info ⁇ nation 343a, size info ⁇ nation 343b, and a file type indicator 344 (for indicating whether the document file corresponds to a textual document or a multi-media file).
  • Relevance scores 342 are listed in descending order from highest to lowest, and files 345 representing textual documents and files 346 representing multi-media documents are interspersed in the list displayed in window 341.
  • the user may either select one ofthe documents listed in window 341 for retrieval, or may instead perform a recursive search or a new search.
  • step 345 After which the process is repeated from step 325. However, as described above, during this recursive searching operation a logical AND operation is performed between the document identification numbers on the current search results list and the document identification numbers on the previous search results list to form a new current search results list in step 340. If the user desires to perform a new (non-recursive) search, processing proceeds to step 325 and the process is again repeated.
  • step 340 if, after step 340, the user wishes to retrieve a document file for display on PC 104, processing proceeds to step 350 where the user selects one ofthe documents in the search list for display, preferably by "clicking" with a computer mouse on the portion of window 341 wherein infoimation corresponding to the selected document file is displayed.
  • step 350 the user selects one ofthe documents in the search list for display, preferably by "clicking" with a computer mouse on the portion of window 341 wherein infoimation corresponding to the selected document file is displayed.
  • PC 104 transmits a signal representative ofthe selected document file to session server 114 over channel 108.
  • session server Upon receipt of this signal in step 355, session server retrieves from database 118 and transmits to PC 104 the text (if the selected document file corresponds to a textual document), image bitmap (if the selected document file corresponds to a still image), sequence of video frames (if the selected document file corresponds to a motion video sequence) or sequence of digital audio frames (if the selected document file corresponds to a digital audio sequence) associated with the selected document file.
  • step 360 upon of receipt of such text or multi-media infoimation from session server 114, PC 104 together with software 106 display the text or multi-media information retrieved by session server 1 14 in a second display window open on PC 104.
  • FIG. 4B is an exemplary drawing showing first and second windows 347, 348 which respectively display text information corresponding to a text document file 345 and video image infoimation corresponding to multi-media file 346.
  • the information displayed in windows 347 and 348 may be displayed simultaneously with the search results list shown in window 341.
  • software 106 is adapted to permit the user to also simultaneously open a document composition window 348a (shown in Figure 4C) on PC 104.
  • the user may cut/copy/paste both text and video information from windows 347,348 (or from further display windows not shown but which represent other selected document files) into the document composition window 348a in order to form a composite document that pertains to the search query input by the user in step 320.
  • the composite document may, for example, include text information from one or more ofthe document files selected by the user in step 350, as well as multi-media information such as, for example, still images.
  • the save and print buttons on the graphical user interface shown in Figure 4C the user may selectively save (on PC 104) and/or print text or image information from windows 347, 348 or 348a.
  • session server 114 retrieves various document/image infoimation from database 1 18.
  • a document infoimation directory table is stored on database 118.
  • the document information directory table includes an individual file header record 400 formed of approximately 25 separate information fields.
  • Record 400 includes a document identification number field 401 for storing a unique document identification number that is associated with each textual document or multi-media file at the time the textual document or multi-media file is prepared for input into data center 110 A discussion of a preferred system for preparing textual documents and multi-media files for input into data center 1 10 is set forth later in this specification
  • Record 400 also includes a publisher identification number field 402 for storing a unique publisher identification number representing the publisher ofthe textual document or multi-media file associated with record 400
  • the publisher identification number stored in field 402 corresponds to an entry point into a publisher information table which includes a plurality of publisher information records 440
  • the pubhsher information table includes an individual record 440 foimed of approximately 10 separate information fields for stonng infomiation about the publisher associated with the document or file
  • Record 400 also includes several fields for stonng Bibliographical information about the textual document or multi-media file associated with record 400
  • the Bibliographical information fields include a title field 403 for stonng the title ofthe textual document or multi-media file associated with record 400, an author last name field 404 and an author first name field 405 for stonng the name ofthe author ofthe textual document or multi-media file associated with record 400, a pubhsher name field 422 for stonng the name ofthe publisher ofthe textual document or multi-media file associated with record 400, a publication date field 423 for stonng the publication date ofthe textual document or multi ⁇ media file associated with record 400, a publication name field 424 for stonng a publication name in which the textual document or multi-media file associated with record 400 appeared, and an edition field 425 for stonng a number representing the edition ofthe publication in which the textual document or multi-media file associated with record 400 appeared
  • session server 114 retrieves (from database 1 18) Bibliographical information
  • Record 400 also includes several fields for stonng copynght infoimation about the textual document or multi-media file associated with record 400
  • the copyright information fields include a copyright date field 406 for storing a copyright date assigned to the textual document or multi-media file associated with record 400, a copyright holder field 407 for storing the copyright owner ofthe textual document or multi-media file associated with record 400, and a copyright message field 408 for storing a copyright message for the textual document or multi-media file associated with record 400.
  • the information stored in the copyright message field 408 associated with the selected document is retrieved (from database 1 18) by session server 114 and transmitted to PC 104.
  • the copyright message stored in field 408 is then displayed on user PC 104 (in step 360) together with the text, image bitmap, or sequence of video frames associated with the selected document.
  • Record 400 also includes a source type field 409 for storing infoimation classifying the textual document or multi-media file associated with record 400 into one of a plurality of "source types."
  • the source types reflect not only the source from which the document may have originated, but also whether the document is a text file or, altematively, a multi-media file.
  • a list of exemplary source types is shown generally in box 349 of Figure 4 A, and include newspapers, maps, pictures (e.g., still or motion video images), magazines, books and TV/radio (transcripts).
  • the system 126 (discussed more folly below) for preparing textual documents and multi- media files for input into the data center 110 assigns a source type to each textual document or multi-media file at the time it is input into data center 110.
  • the source types shown in box 349 may be selected by the user in order to focus a query search on a subset of documents in database 118 associated with one or more source types selected by the user.
  • Record 400 also includes a document readability field 410 for storing a readability score (or reading grade level) for the textual document or multi-media file associated with record 400.
  • this readability score is determined (by system 126) for each document to be stored in database 1 18 using a standard readability algorithm such as, for example, the Fleish-Kincade algorithm.
  • session server 114 retrieves (from database 118) a readability score (from field 410) for each document identification number included on the search results list transmitted from query server 1 16, and then transmits this readability information (along with Bibliographical information for each such document identification number) to PC 104 where it is displayed as readability information 343a.
  • Record 400 further includes a subject information field 41 1 for storing information classifying the textual document or multi-media file associated with record 400 into one or more ofa plurality of "subjects."
  • the possible subjects into which a document may be classified include, for example, sports, politics, literature, etc.
  • the system 126 (discussed more folly below) for preparing textual documents and multi-media files for input into the data center 1 10 assigns one or more subject types to each textual document or multi-media file at the time it is input into data center 110.
  • system 126 assigns a subject confidence score indicating the degree to which the document relates to the subject.
  • search engine 142 may be adapted to use the search subjects selected by the user, together with the subject confidence scores associated with documents identified during a query search, to calculate relevance scores for the documents identified during the query search.
  • Record 400 also includes a document type field 412 for storing a "document type" assigned to the textual document or multi-media file associated with record 400.
  • the document type field 412 reflects whether the document is a text file, a still image, a sequence of motion video images, or a sequence of digital audio images.
  • the system 126 (discussed more folly below) for preparing textual documents and multi-media files for input into the data center 110 assigns a document type to each textual document or multi-media file at the time it is input into data center 1 10.
  • session server 1 14 retrieves (from database 118) a document type (from field 412) for each document identification number included on the search results list transmitted from query server 116, and then transmits this document type infoimation (along with Bibliographical infoimation for each such document identification number) to PC 104 where it is displayed in the form of a file type indicator 344.
  • Record 400 also includes a last retrieval date field 413 for storing the date on which the textual document or multi-media file associated with record 400 was last retrieved for a user by document retrieval system 136 (discussed below in conjunction with Figure 6).
  • the information stored in the last retrieval date field is update each time document retrieval system 136 retrieves the textual document or multi-media file associated with record 400 in response to a user request.
  • Record 400 further includes three fields for storing information about when, and how often, the textual document or multi-media file associated with record 400 has been installed on database 118.
  • an original install date field 414 is provided for storing the date the textual document or multi-media file associated with record 400 was first stored on database 118
  • a last install date field 415 for storing the most recent date on which the textual document or multi-media file associated with record 400 was installed on database 1 18, and
  • an install count field 416 for storing the number of times the textual document or multi-media file associated with record 400 has been installed on database 118.
  • System 126 for preparing textual documents and multi-media files for input into the data center 110 updates fields 414, 415 and 416 for each textual document or multi-media file at the time it is input into data center 110 and loaded onto database 1 18.
  • Record 400 further includes a filename field 417 for storing a filename in database 118 wherein either (i) the foil body ofthe textual document represented by record 400 (if record 400 represents a textual document), or (ii) a separate body of text associated the digital video or audio file represented by record 400 (if record 400 represents a multi- media file) is stored.
  • the full body ofthe textual documents are stored in the text-type information fields 470 (shown in Figure 5B).
  • the separate bodies of text associated with such records are stored in image text-type information fields 460 (shown in Figure 5B).
  • Offset field 418 stores a location pointer indicating the starting location of either the text- type information or image text -type infoimation field corresponding to the filename stored in field 417.
  • offset field 418 points to the beginning of a text -type information field 470 wherein the foil body ofthe textual document associated with record 400 is stored;
  • offset pointer 418 points to the beginning of an image text -type information field 460 wherein the associated text field corresponding to the record 400 is stored.
  • Record 400 also includes a document size field 419 for storing a file size (in bytes) ofthe document associated with record 400.
  • session server 114 retrieves (from database 118) a size (from field 419) for each document identification number included on the search results list transmitted from query server 116 corresponding to a textual document, and then transmits this size information (along with Bibliographical information for each such document identification number) to PC 104 where it is displayed as size information 343b.
  • record 400 includes Dewey decimal call number field 420 and library of congress call number field 421, for respectively storing any Dewey decimal or library of congress call numbers corresponding to the textual document of multi-media file represented by record 400.
  • the dependent image table includes a separate dependent image record 430 for each record 400 in the document infoimation directory table which represents a multi-media file.
  • each record 400 having a document type field 412 corresponding to either a still image, a sequence of motion video images, or a sequence of digital audio images has a conesponding dependent image record 430 stored in the dependent image table.
  • Record 430 includes a document identification number field 431 for storing the document identification number corresponding to the multi-media file represented by the record 430.
  • session server 1 14 For a given multi-media file stored in database 118, the same document identification number used in field 401 of record 400 is repeated in field 431 of record 430. Thus, once session server 1 14 has a document identification number and type infoimation (from document type field 412) that a particular document identification number corresponds to a multi-media file, session server 1 14 can access further information about the multi-media file simply by using the document identification from record 400 as an entry pointer to access a further record 430 in the dependent image table.
  • Each record 430 includes a filename field 432 for pointing to a file in database 1 18 wherein only digital video or audio information represented by record 430 is stored. For each record 430, this digital video or digital audio information is stored in an image-type field 490 (shown in Figure 5B) on database 118. Offset field 433 stores a location pointer indicating the starting location of the image-type field 490 in the file identified by field 432. Record 430 also includes a document size field 434 for storing a file size (in bytes) ofthe multi-media file associated with record 430.
  • session server 114 retrieves (from database 118) a size (from field 434) for each document identification number included on the search results list transmitted from query server 116 which corresponds to a multi-media file, and then transmits this size infoimation (along with Bibliographical infoimation for each such document identification number) to PC 104 where it is displayed as size information 343b.
  • the publisher identification number stored in field 402 of each record 400 corresponds to an entry point into a publisher information table which includes a plurality of publisher information records 440.
  • Each publisher information record 440 includes several information fields for storing information about a publisher having textual documents or multi-media files stored on database 118.
  • each publisher information record 440 includes a publisher identification number field 441 for storing a unique identification number associated with the publisher represented by the record 440, fields 442-448 for storing name and address infoimation associated with the publisher represented by the record 440, and fields 449-451 for storing infoimation identifying a contact person at the publisher represented by the record 440.
  • Data structure 460 contains a list of search terms 461 corresponding to potential search terms which may appear in a search query. For each search term 461 listed in data structure 460, data structure 460 stores a document identification number 462 representing each document file (stored in database 118) that includes that search term, along with location information 463 corresponding to the location(s) ofthe search term in each such document file.
  • location infoimation 463 represents the location of the search term 461 in either the text-type information field 470 or the image text-type information field 460 pointed to by the offset field 418 associated with the document identification number 462.
  • document identification number 462 represents a textual document
  • location information 463 will represent the location ofthe search term 461 in the actual body ofthe text document;
  • location information 463 will represent the location ofthe search term 461 in the associated text field corresponding to the multi-media file
  • Session/Ouerv Server Architecture Referring now to Figure 6, there is shown a block diagram illustrating the operation of a session server software system 1 14a and a query server software system 116a for respectively implementing the session and query servers 114, 1 16 shown in Figure 4, in accordance with a preferred embodiment ofthe present invention
  • session and query servers 1 14, 116 in data center 110 are adapted to simultaneously receive and process different search queries from multiple user stations 102.
  • session server software 114a includes a plurality of session managers 130, each of which is responsible for monitoring and directing a single active user session 300 corresponding to an individual user station 102
  • the corresponding session manager (i) sends queries received from the user station 102 to a query engine interface 134, (ii) retrieves a search results list (described above) from the query engine interface 134, (iii) retrieves Bibliographical information corresponding to the documents identified in the search results list from database 1 18a and transmits such information to user station 102, (iv) retrieves text and multi-media files identified by user station 102 from database 118b using document retrieval system 136 and transmits such files to user station 102, (v) transmits information representing each document retrieved from database 118b and user identification information corresponding to the user station 102 that requested retrieval of that document to accounting manager 138, and (vi) updates a query log of all search queries handled through the session manager 130 An active
  • records 400 (and 430, if applicable) corresponding to the document file and representing infoimation such as, for example, a document identification number, Bibliographical data and publisher data are stored in database 118a together with one or more pointers (417, 418, 432, 433) corresponding to the location ofthe foil text/image corresponding to the document file in database 118b.
  • document retrieval system 136 initially retrieves pointer infoimation corresponding to the foil text/image ofthe file from the Document Information Directory Table and Dependent Image Table on database 118a, based on the document identification number ofthe document file to be retrieved. Based on this pointer information, document retrieval system 136 then retrieves the full text/image corresponding to the file from database 1 18b.
  • Query server software 1 16a includes a plurality of search engines 142, each of which is responsible for (i) executing a search query passed to the search engine 142 by session server 1 14a, and (ii) returning a relevance ordered search results list corresponding the query to session server 114a.
  • the search engines 142 are each coupled to document index database 117, and schedulers 144 monitor and queue the searches performed by search engines 142 on document index database 117. Parameters representing search queries and search results are transmitted between a search engine 142 and session server 114a using a search engine application program interface (API) 140.
  • API application program interface
  • Session manager software system 130 maintains the state of each ongoing user session in system 100. Entry into session manager software system 130 begins with initial/profile state 130a, which is invoked when a user login request is passed to data center 110 through communications channel 108. In the initial/profile state 130a, session manager software system 130 functions with enrollment means 120 and authentication means 122 to verify user authenticity and/or enroll new users into data center 110. The tasks performed by enrollment means 120 and authentication means 122 in state 130a correspond substantially to the enrollment and authentication functions described above in connection with steps 310 and 315 of user session 300.
  • session manager software 130 authenticates a user and verifies that there are no other user sessions corresponding to the user's identification number which are active on data center 1 10, session manager software system 130 enters a wait for user action state 130b, which waits for a user station 102 to request services from session manager 130. As each user request is received, session manager software system 130 transitions the session server 114 into an appropriate state, and then retums back to the wait for user action state 130b after the request has been serviced. In the preferred embodiment, entry into the document retrieval state 130d and the recursive query state 130e may occur only after query server 116 has processed a search query for the user station 102, and retumed a valid search results list for the search query to session server 114.
  • the query state 130c is responsible for collecting search parameters input to the user station 102, sending the search parameters to the query server 116, retrieving a search results list form the query server 116, and then passing search results back to the user station 102.
  • the search parameters collected from the user station 102 and forwarded to query server 1 16 include, at a minimum, a query string representing a natural language query entered by the user.
  • the search parameters may also include one or more document source selections which the user of user station 102 may optionally make using box 349 (of Figure 4A).
  • Such document source selection(s) permit the user of user station 102 to focus the user's search only on document records having source type fields 409 corresponding to one or more selected document sources.
  • the search parameters may also include one or more document subject selections which the user of user station 102 may optionally make using box 349a (of Figure 4A).
  • Such document source selection(s) permit the user of user station 102 to focus the user's search only on document records having subject information fields 41 1 corresponding to one or more selected subjects.
  • the query state 130c Upon successful completion ofthe search, the query state 130c is responsible for receiving a relevance ordered search result list from the query engine 1 16, and then passing a relevance score 342, Bibliographical information 343, readability information 343a, size information 343b and a file type indicator 344 corresponding to each document identification number on the search result list to user station 102 for display in window 341.
  • Query state 130c is responsible for forming a "current search results list" by saving the search results list generated by query engine 1 16.
  • the "current search results list” may be used in recursive query state 130e to allow the user of user station 102 to perform recursive searching.
  • query state 130c retums control to the wait for user action state 130b.
  • session manager software 130 enters document retrieval state 130d.
  • session server 114 receives a signal from user station 102 representing the document identification number ofa particular document from window 341 that the user wishes to retrieve into a window 347, 348.
  • State 130d is responsible for retrieving (using retrieval system 136) the full text, digital video or digital audio information corresponding to the selected document identification number from database 118, and passing then transmitting this document infoimation back to user station 102. Once the appropriate document infoimation is passed to user station 102, control is passed back to the wait for user action state 130b.
  • session manager software system 130 enters recursive query state 130e.
  • the recursive query state 130e is responsible for collecting a forther set of search parameters input by the user of user station 102, sending these search parameters to the query server 116, and retrieving a search results list from the query server 1 16.
  • a logical AND operation is then performed on the document identification numbers in this search results list and the document identification numbers in the search results list previously saved (either during state 130c or 130e) in order to form a new current search results list which is then passed back to the user station 102.
  • the query state 130e is responsible for passing a relevance score 342, Bibliographical information 343, readability infoimation 343a, size info ⁇ nation 343b and a file type indicator 344 corresponding to each document identification number on the new current search result list to user station 102 for display in window 341.
  • recursive query state 130e retums control to the wait for user action state 130b.
  • session manager software 130 When the user of user station 102 desires to terminate a user session, session manager software 130 enters an exit state 13 Of, wherein session manager software 130 causes the active session manager component 132 to remove the user identification number corresponding to the user of user station 102 from the list of active user sessions
  • Fig 6B there is shown a flow diagram illustrating the operation of a software system for implementing search engine 142, in accordance with a preferred embodiment ofthe present invention
  • a query is applied to the query server 116
  • all the individual terms in the query have an equal baseline weight
  • all terms in the query are equally likely to contribute to the selection ofa document by the query server 116
  • the weights ofthe terms in the query can be adjusted relative to each other in accordance with the query term weight adjustment method shown in Figure 6B
  • step 142a the query fields, which include a string of terms corresponding to a natural language search, are received from the search engine API 140 by the search engine 142
  • step 142b the parts of speech of each term in the natural language query is determined
  • the method for determining the parts of speech ofthe natural language query terms in step 142b includes the use of an improved Hidden Markov Model discussed below
  • An adjustment ofthe weight of each term in the query relative to the weight ofthe other terms ofthe query is determined More particularly, in step 142c, a determination is made whether a particular term in the natural language query (or query word) is a proper noun If the query term is not a proper noun, as determined in step 142c, a determination is made in step 142f whether the query term is a slow word Slow words are words that have been determined, through experience, to usually be of little help in improving the results of searches.
  • the weight of slow words is reduced, but the slow words are not eliminated from the query.
  • the weight of slow words can be adjusted, for example, to about half of the baseline value Some examples of slow words are best, better, say, really and now.
  • the term "semantic network” means a database in which each term carries with it semantic information such as, for example, part of speech and synonym information.
  • step 142j For a given query word, only those expansion words from the semantic network which are ofthe same part of speech as the query term are added to the terms in the natural language query in step 142f.
  • step 142j the weight adjusted terms from the original natural language query, together with any additional search words obtained by the semantic network expansion are applied to index database 1 17 in order to (i) identify documents (stored in database 118) that are relevant to the query search, (ii) and determine a relevance score for each document identified.
  • step 142j is implemented using the ConquestTM Search engine marketed by Excalibur Technologies (hereinafter, the ConquestTM searching software).
  • a query word is a proper noun, as determined in step 142c, its weight is increased in step 142g, and this increased searching weight is then used by the ConquestTM searching software in calculating relevance scores for documents identified by the search engine.
  • the proper nouns may be adjusted, for example, to about one and half times the baseline value. Additionally, proper nouns are not applied to the semantic network for expansion. It has been determined that by-passing the semantic network in this manner for proper nouns decreases the amount of noise in the results obtained by the database search.
  • step 142d The output ofthe search performed in step 142j includes a relevance score for each document identified during the search.
  • step 142h (described more folly below under the heading "Relevance Normalization"), the output relevance scores are normalized.
  • step 142L the document identification numbers (identified in step 142j) and normalized relevance scores corresponding to such document identification numbers are transmitted to search engine API 140.
  • Hidden Markov Model Referring now to Fig. 7a, there is shown a diagram illustrating hidden Markov model 700.
  • the hidden Markov model 700 is a conventional hidden Markov model used for determining parts of speech of words in natural language queries in the form of complete sentences
  • the hidden Markov model 700 includes an initial transition matrix 710 and a subsequent transition matrix 720
  • the initial transition matrix 710 sometimes refened to as a Pi matrix, contains a plurality of different probabilities p a , p b , p c that indicate the likelihood that the first word ofa query is a selected part of speech as indicated on the vertical edge ofthe initial transition matrix 710
  • the probability that the first word ofthe query is a determiner is p a
  • the probability that the first word ofthe query is a noun is p b
  • the probabilities ofthe Pi matrix are obtained by counting the occurrences of each part of speech at the beginning ofthe sentences of a sample set of text documents, wherein the part of speech tagging can be performed by a human
  • the subsequent transition matrix 720 includes a probability that indicates the likelihood that the next word ofa query will be a particular part
  • the modified hidden Markov model 750 used for determining parts of speech in natural language queries formed from sentence fragments, in accordance with a preferred embodiment ofthe present invention
  • all ofthe probabilities in the initial transition matrix 710 are made equal to each other
  • the probabilities in the subsequent transition matrix 720 are unchanged
  • the modified Markov model 750 is applied to the words in the same manner as the hidden Markov model 700 is applied
  • the use ofthe modified hidden Markov model in this manner provides better parsing results, and therefore better searching results, when the query applied to the search engine is a sentence fragment rather than a foil sentence
  • the value p is preferably set at 1/n where n is number of rows in the initial transition matrix 710 Relevance Normalization Referring now to Figs.
  • the relevance normalization table 800 and the relevance normalization curve 850 for normalizing the relevance scores output in step 142j by the ConquestTM searching software.
  • the relevance score is an indication of how relevant the document is in accordance with the words in the query. For a given document identified during a query search, the relevance score is determined according to, for example, the proximity of query search words located in either the image-text information field 460 or text-type infoimation field 470 associated with the document.
  • the relevance scores determined by the ConquestTM searching software in step 142j are influenced by the number of words in the query. If the query is short, the relevance scores retumed are likely to be higher because it is easier for documents to contain all ofthe words ofthe query within a fixed size search window. If the query is long, it is less likely that a document will have all the words ofthe query within the fixed size search window, and when the probability of including all ofthe query words within the fixed size search window decreases, the computed relevance ofthe document decreases.
  • the relevance scores output by the ConquestTM searching software in step 142j are normalized to make the relevance scores independent of query length.
  • the graphical representation 800 is composed ofa number of range fields such as range fields 804a, b, c.
  • the values within the range fields ofthe curve 800 correspond to the number of words in a query that were found outside a search window used to search the documents represented in the database 1 17. For a given document, as the number of words in the query falling outside the search window decreases, the relevance score assigned by the ConquestTM searching software to the document increases. For example, if none ofthe words are outside the search window, the document receives a high relevance score as indicated by the occurrence of range fields 804a containing zeros on the left hand side ofthe graphical representation 800.
  • relevance normalization table 800 the number of words in a query, ranging from one to ten, is shown on the vertical edge Each division along the horizontal edge of the table 800 corresponds to a window segment n Each window segment n has a range of relevance scores retumed by the searching software wherein n is the number of query words outside the search window
  • the window segments range from wsO to ws9 along the horizontal axis ofthe table 800
  • the relevance scores ofthe relevance normalization table 800 are then curved to span the various relevance score ranges shown in table 800 These ranges are ranges defined by the numbers along the rows of table 850 within the window segments n
  • the curving ofthe scores is performed in order to make the normalization provided to the user station 102 independent of query length in accordance with the graphical representation 850 wherein the relevance normalization curve 850 is merely a
  • a query has a length of five words and a retumed document has a relevance between 49 and 59
  • This example conesponds to row 5 of graphical representation 850
  • a string formed ofthe number 2 representing the two query words outside the window segment, spans the range of relevance values between approximately 49 and 59
  • the relevance normalization table 800 it can be seen that on row 5 the relevance scores below ws2 corresponding to window segment 2 are 79 and 75 This indicates that retumed relevance values between 49 and 59 are mapped into relevance scores in the range of 75 to 79 under these circumstances when relevance score normalization is performed
  • the mapped values in the range of 75 to 79 are the ones provided to the user station 102
  • a data preparation component 900 ofthe information retrieval system 100 ofthe present invention and a block diagram 980 of data flows within the data preparation component 900
  • the operation of data preparation component 900 is preferably performed within document preparation/accounting block 126 of data center 1 10
  • the data preparation component 900 receives information representative of documents from a plurality of publishers or document sources 112 such as, for example, image-type information 480
  • the document information received from the sources 112 is converted from the input format ofthe publisher 112 into the universal publisher processing format within the data preparation component 900
  • the data preparation component 900 stores the converted document information in the image/text database 118 for retrieval in response to a query from a user station 102.
  • the publisher format conversion block 912 ofthe data preparation component 900 includes N publisher format conversion units 916
  • the publisher format conversion units 916 ofthe format conversion block 912 control the parsing and conversion ofthe input document information from the various publisher-specific formats into the universal publisher processing format supported by the infoimation retrieval system 100
  • the publisher format conversion block 912 therefore contains one publisher format conversion unit 916 for each ofthe N different publishers 112 in order to perform the conversion ofthe information provided by the publishers 112
  • Additional publisher format conversion units 916 can be added to the publisher format conversion block 912 as new publishers 112 are added to the list of publishers 1 12 that provide input information for storage and searching by the information retrieval system 100.
  • the data preparation component 900 has a modularity that permits it to be easily modified to support any number of new publishers 1 12 as they are added to the information retrieval system 100
  • the input to the N different publisher format conversion units 916 ofthe publisher format conversion block 912 is the raw data provided by the individual publishers 112 to data center 1 10
  • the output ofthe publisher format conversion units 916 is a converted document suitable for processing, searching and retrieval within the information retrieval system 100
  • the publisher format conversion units 916 also provide a document information data structure in response to each input document A unique document identification number is also provided for each input document
  • the document information data structure is a data structure containing the document information obtained from the publisher 112, parsed by the format conversion units 916 and made available within data center 110
  • the elements included in the document info ⁇ nation data structure provided by the publisher format conversion units 916 are based upon the processing and retrieval requirements ofthe information retrieval system 100 and the individual varying publisher formats that are received for conversion.
  • the list of data structure elements is therefore not intended to be exhaustive. Additionally, the data structure element list can change as more publishers 112 and more publisher formats are added to the infoimation retrieval system 100. Not all elements in the document data structure are applicable to every document.
  • the document information data structure provided by a publisher format conversion unit 916 can include, for example, in the case ofa novel, such elements as a name ofthe author ofthe novel, a parent title, such as the title ofthe novel, and child titles, for example, the titles ofthe chapters within the novel.
  • the document infoimation data stmcture can also include elements for indicating the number of charts or the number of diagrams set forth within the novel. Copyright infoimation can also be available in the document information data stmcture if the novel has been copyrighted.
  • documents are received for conversion by the publisher format conversion units 916, they can be saved in temporary storage in order to increase document conversion throughput.
  • the temporary storage of input documents lined up for reformatting by the document preparation component 900 is often best performed on a hard disk drive (not shown) because hard disk drives usually have a higher data transfer rate than the type of peripherals from which the input documents are commonly received from the publishers 112, such as CD-ROM, DAP, tape drives and floppy diskettes.
  • Each ofthe publisher format conversion units 916 transmits document infoimation to the image reformatting unit 904 and the text reformatting unit 908 within the document preparation component 900.
  • the image reformatting unit 904 is responsible for converting graphics and video such as image-type information 480 in supported publisher image formats into the image format supported by the information retrieval system 100 ofthe present invention, and providing the converted graphics and video as image-type information 480.
  • the image format supported by the information retrieval system 100 can be any conventional image format.
  • the image format supported by the infoimation retrieval system 100 can be the conventional JPEG file interchange format as described below.
  • the image reformatting unit 904 includes image conversion routines that perform the operations required to reformat the graphical or video input image infoimation received from the publisher format conversion units 916 ofthe publishers 1 12
  • the image conversion routines ofthe image reformatting unit 904 can be obtained from public domain software packages.
  • the image conversion routines can include, for example, PBM Plus.
  • the PBM Plus image conversion routine is an extended portable bit map tool kit known to those skilled in the art.
  • the output of the publisher format conversion units 916 ofthe data preparation component 900 can have different formats depending on whether the data type ofthe document is image-type infoimation or text -type information.
  • Converted image-type information 480 from the format conversion units 916 can be stored in one or more different formats in data center 110, including, for example, any ofthe conventional JPEG image file interchange formats known to those skilled in the art of infoimation processing.
  • the conventional JPEG image format includes a standard image compression mechanism developed by the Joint Photograph Experts Group.
  • the JPEG standard does not require any particular file format.
  • the JPEG file interface format used in the information retrieval system 100 is a file format that permits JPEG files to be exchanged with the PC, MAC and UNIX workstation platforms.
  • the JPEG file interface format ofthe information retrieval system 100 can conform to the JPEG Draft Intemational Standard (ISO DIS 109 18-1).
  • the standard mechanism ofthe JPEG standard typically compresses image data with a compression ratio between 10: 1 and 20:1 without visible loss of image quality. Compression ratios between 30: 1 and 50: 1 are also possible with small to moderate loss of image quality using the JPEG standard.
  • the JPEG standard is used for graphic data as well as image data within the data retrieval system 100.
  • the image reformatting unit 904 invoked when a publisher format conversion unit 916 within the publisher format conversion block 912 is provided with an image format converter function.
  • the image format converter function serves as a gateway to the image reformatting unit 904.
  • the format converter function gateway is responsible for routing unconverted images and graphics to the image reformatting unit 904 and for routing converted images and graphics from the image reformatting unit 904
  • a format converter function gateway applies an unconverted input image to the image reformatting unit 904
  • the image reformatting unit 904 receives a file pointer
  • the file pointer points to a file containing the input image to be converted
  • the image reformatting unit 904 also receives the offset within the file to the image that is to be converted.
  • the image reformatting unit 904 retums a converted image 480 to the appropriate publisher format conversion unit 916
  • the retumed converted image 480 is suitable for storage in database 118
  • the publisher format conversion units 916 ofthe publisher format conversion block 912 also communicate with the text reformatting unit 908 within the data preparation component 900.
  • the text reformatting unit 908 converts supported publisher text formats into the text format supported by the information retrieval system 100
  • the text format supported by the system 100 can be any conventional text format
  • the text reformatting unit 908 can support ASCII text format
  • the text reformatting unit 908 can provide support for tables and charts using ACSII characters
  • Conventional public domain text conversion routines can be used to perform the text reformatting operations of the text reformatting unit 908 ofthe data preparation component 900
  • the text reformatting unit 908 can include program blocks designed to receive input information in proprietary publisher data formats and convert the input information from the proprietary formats to the supported format
  • the text reformatting unit 908 receives a file pointer from a publisher format conversion unit 916 within the publisher format conversion block 912
  • the pointer received by the text reformatting unit 908 points to the input file containing the text to be reformatted
  • the publisher format conversion units 916 within the publisher format conversion block 912 are also coupled to a text update unit 928
  • the text update unit 928 stores converted text infoimation in image/text database 1 18 in text-type info ⁇ nation fields 460 and image text-type info ⁇ nation fields 470 Each document or image can be stored in a subject category within the image/text database 1 18 in the preferred embodiment ofthe information retrieval system 100.
  • the text update unit 928 receives a pointer to the document information data stmcture ofthe text (i.e . the image text-type information field 460 or text-type infoimation field 470), and the name ofa file (stored in field 417 of record 400) that contains the information that the information retrieval system 100 requires for indexing the document. Text update unit 928 uses this and other infoimation to build a record 400 for each text document processed by the unit
  • the text update unit 928 temporarily stores documents as the documents are received. It begins queuing the infoimation that is necessary to index the documents that are added to the system 100 When all documents for a data preparation session are added to the information retrieval system 100, the text update unit 928 invokes the index update unit 932 and passes to the index update unit 932 the necessary information to add the documents to the image/text database 1 18
  • the image update unit 924 also receives information from the publisher format conversion units 916 ofthe publisher format conversion block 912 within the data preparation component 900.
  • the image update unit 924 stores a converted image document received by the publisher format conversion unit 916 from the image reformatting unit 904 in an image-type info ⁇ nation field 480 in the image/text database 1 18 In the preferred embodiment ofthe information retrieval system 100, the image can be stored in subject category
  • the image update unit 924 receives a pointer (stored in fields 432, 433 ofa record 430) corresponding the document info ⁇ nation stmcture output by the publisher format conversion units 916
  • the image update unit 924 temporarily stores documents as they are received from the conversion units 916
  • the infoimation retrieval system 100 can receive both (1) unconverted text-type information 908a representative of text, and (2) image-type information 480 representative of images
  • the information retrieval system 100 can also store both types of information in the image/text database 118
  • the information retrieval system 100 also receives unconverted image text-type infoimation 908b
  • Image text-type info ⁇ nation 908b is text infoimation that is associated with a specific item of image- type infoimation 480.
  • the image text-type infoimation 908b associated with an item of image-type info ⁇ nation 480 can be a human readable caption associated with the image or a string of words associated with an image for the purpose of computer searching
  • some image-type information 480 can be received by the information retrieval system 100 without any associated image text -type information 908b and associated image text-type information 908b can be generated at data center 1 10 for association with the image-type information 480 prior to storage of image-text type infoimation field 460 and the image type information field 480 in the image/text database 118
  • the image-type information 480 received is stored in the correct image format by the image update unit 924 Additionally, both the text-type information 470 and the image text-type information 460 associated with the image-type information 480 are stored by the text update unit 928 in the image/text database 118.
  • the text-type information 470 and the image text-type information 460 are stored in the image/text database 118 in the same text infoimation format after conversion by the data preparation component 900 Because they are stored in a common text information format, the image text-type information 460 associated with the image and the text-type infomiation 470 not associated with an image are not distinguishable from each other in database searches, and they can be searched by the information retrieval system 100 at the same time in response to a single query Images located using image text information in this manner can be referred to as dependent images
  • image text-type information 460 is located by a search within the information retrieval system 100, the image-type information 480 associated with the located image text-type information 460 can be retrieved and provided to the user station 102 in accordance with the located image text-type information 460 If the user statron 102 has windows software, the image and the text can be displayed in a multiple window 347, 348 or copied into a single window 348a
  • the audio-type information can be digitally stored in a system 100 database in a conventional audio information format
  • the audio-type information can also be provided with associated audio text information
  • the audio text information can be stored by the text update unit 928 in the same manner as previously described with respect to the image text information.
  • ⁇ text-type infoimation not associated with either image infoimation or audio information, the image text information associated with image information, and the audio text information associated with audio information are all stored in a single common infoimation format in the image/text database 1 18. Additionally, they are all searched simultaneously within the infoimation retrieval system 100 in response to a single query from the user station 102.
  • the associated audio information associated with it can be selected can be retrieved.
  • the publisher format conversion units 916 are also coupled to a readability analysis unit 920.
  • the readability analysis unit 920 produces an index of readability for documents processed by the data preparation component 900.
  • the readability analysis unit 920 can also provide a value corresponding to the scholastic grade level of an input document from a publisher 112.
  • the readability index is stored permanently as an attribute ofthe document in field 410 of a record 400. For documents that are updated regularly, for example, newspapers, the readability index is assumed to be the same for all issues. Thus, once calculated by the readability analysis unit 920 for a single issue or volume, the readability index value is promulgated to all subsequent issues ofthe document.
  • the readability index is preferably determined based upon the rate at which new terms are introduced into the text as well as by the distribution of words found in the documents compared to distributions of categorized text.
  • An index update unit 932 is also coupled to the publisher format conversion block 912 ofthe data preparation component 900.
  • the index update unit 932 updates the document indexes within the document index 117 when a new document or group of documents is added to the data center 110. Additionally, the index update unit 932 updates the document index database 117 when documents are purged from the data center 110.
  • the index update unit 932 updates the document information directory table. When purging documents from the data center 110, the index update unit 932 deletes the document text or image from the document infoimation directory and/or dependent image tables in database 118, and marks the document as deleted in the index system
  • Svstem Documents stored in the image/text database 118 are divided into a number of different subject categories
  • the user ofthe information retrieval system 100 is permitted to specify a subject or subjects for search in response to a user query
  • the user may wish to search only sports literature or only art literature in the image/text database 118 This avoids searching through documents in the database 118 that are unlikely to be of interest to the user
  • the input documents applied to the information retrieval system 100 must be categorized or subjected after they arrive from the publishers 1 12 Categorizing of input documents can be a very laborious and time consuming process when performed by a human viewer because a very large number of incoming documents can be added regularly to the data center 110 Additionally, determinations can be made for each input document whether it should be categorized in more than one subject category
  • an automatic docket subjecting system 1000 receives documents processed by the data preparation component 900 and stored in a document pool 1004
  • the automatic document subjecting system 1000 automatically determines the subject category of the received documents without any review ofthe received document by a human viewer It will be understood that the automatic document subjecting performed within the data center 1 10 thus eliminates the need for the laborious and time consuming subjecting operations performed in the prior art wherein humans reviewed the input documents received by an information retrieval system in order to determine the subject categories of each received document for storage in a system database such as the image/text database 118
  • the automatic document subjecting system 1000 includes a plurality of individual subject databases 1016 which can reside within the image/text database 1 18 ofthe image retrieval system 100
  • Each subject database 1016 contains documents relevant to a separate subject category
  • one subject database 1016 can contain only documents relating to the subject of art
  • Another subject database 1016 can contain only documents relating to the subject of sports
  • Subject categories have also been provided, for example, for science, religion, health, education, geography and recipes.
  • the automatic document subjecting system 1000 also includes an individual subject lexicon 1020 coupled to each subject database 1016.
  • Each subject lexicon 1020 includes a list of classifier words whose occurrence in a document has been determined to be an indicator that the document is likely to be classified in the subject category corresponding to the subject lexicon 1020.
  • the behavior ofthe classifier words included in the subject lexicons 1020 can be determined according to a statistical analysis described hereinbelow.
  • the subject lexicon 1020 includes a discriminator weight for each classifier word in the subject lexicon 1020.
  • the discriminator weight provides a measure ofthe difference between the behavior ofa classifier word in a subject database 1016 and its behavior in the overall document pool 1004 which contains documents from all subject categories 1016.
  • the automatic document subjecter system 1000 includes an automatic subjecting block 1012.
  • the automatic subjecting block 1012 performs a comparison ofthe input documents from the publishers 112, stored in the document pool 1004, with the various subject lexicons 1020 in order to categorize them into their correct subject category.
  • the words ofthe input document in the document pool 1004 to be subjected are compared by the automatic subjecter block 1012 with the classifier words ofthe various subject lexicons 1020 by way ofthe word comparison line 1022. Based upon this comparison, the input documents stored in the document pool 1004 are stored in selected subject databases 1016 by way ofthe appropriate subject category line 1018.
  • the determination as whether to include a document in a subject database 1016 is based on a comparison ofthe various confidence scores determined for the document. For example, in one embodiment, if a confidence score associated with a particular subject exceeds the confidence score associated with all other subjects by 30% or more, then the document is determined as being associated with the particular subject and is selected for inclusion in the database 1016 associated with the particular subject. It will be understood by those skilled in the art that other thresholds may be used in comparing the various confidence scores to determine whether to select a document for inclusion in a particular subject database 1016.
  • the subject lexicon generation process flow 1100 and the subject lexicon generation system 1200 ofthe present invention describe the generation ofthe subject lexicons 1020 ofthe automatic document subjecting system 1000 in order to permit the automatic subjecting system 1000 to receive input documents from the publishers 112 stored in the document pool 1004, and automatically determine a subject area category for the received input documents.
  • the input document can be sorted into a selected subject database 1016 as previously described.
  • a miniature corpus 1232 including a small number of documents, for example five or ten documents, known to be typical ofthe subject category is selected by a human viewer as shown in block 1104 ofthe process flow 1100. This selection can be performed using the query program 1204 to select the small number of subjected documents from the document pool 1004.
  • the documents selected from the document pool 1004 by the query program 1204 thus form the miniature corpus 1232.
  • the miniature corpus 1232 ofthe selected subject can be referred to as the seed 1232 because a subject lexicon 1020 is developed from it by the system 1200 following the process flow 1100.
  • Statistics on the words in the selected documents ofthe miniature corpus 1232 are determined as shown in block 1108.
  • the statistics include, for example, the frequency of occurrence of each word in the miniature corpus, and the number of documents in the miniature corpus in which each such word appears.
  • the generation of these statistics is performed by the buildjexs program 1236.
  • a preliminary lexicon 1238 is developed by determining the words most likely to appear in the documents ofthe miniature corpus 1232 based upon the statistics determined by the buildjexs program 1236.
  • the buildjexs program 1236 can build a preliminary lexicon 1238 for each subject category database 1016 in this manner.
  • the buildjexs program 1236 then compares the behavior ofthe words in the miniature corpus 1232 ofthe selected subject area, as determined in block 1108, with the behavior of the same words in the overall document pool 1004.
  • the pool stats determine program 1208 determines the statistics ofthe words in the document pool 1004.
  • the statistics determined by the pool stats determine program 1208 include the frequency of occurrence ofthe words and the number of documents in which each word appears. These statistics are stored in pool statistics 1008. Since these statistics are required on a very large number of different words, it may be convenient to perform the tasks in segments and combine the results using the comb_stats program 1212. These statistics in pool_stats 1008 are applied to the buildjexs program 1236 by way ofthe line 1210.
  • a discriminator weight is determined by the buildjexs program 1236 for each classifier word (or unigram) in the subject lexicon 1020.
  • the discriminator weight ofa word is equal to log (P Wmc /P Wc ), where P Wmc is the probability of the word in the miniature corpus 1232, and P Wc is the probability ofthe word being in the document pool 1004.
  • the discriminator weights calculated in this manner are stored in the subject lexicon 1020 for each classifier word. In this manner the subject lexicon generation system 1100 and the subject lexicon generation system 1200 determine classifier words ofthe various subject lexicons 1020. Additionally, a discriminator weight can be determined for "bigrams" that can act as classifier words in the manner described above with respect to unigrams. Bigrams are word pairs that occur together in documents and can therefore provide useful information about the subject area ofthe subject database 1016.
  • the discriminator weights for bigrams are calculated using the mutual information formula set forth in equation (1) below:
  • the documents selected by the subj_pool program 1216 are the documents in the document pool 1004 that have the highest confidence of being in the selected subject area, as determined using the classifier words and confidence levels ofthe preliminary lexicon 1238 built by the buildjexs program 1236 in the manner described with respect to auto-subjector 1012.
  • the confidence threshold used by the subj_pool program is preferably the same as that described above for auto-subjector 1012. Thus, if a confidence score associated with a particular subject exceeds the confidence score associated with all other subjects by 30% or more, then the document is determined as being associated with the particular subject by the subj_pool program.
  • the subj_ pool program 1216 is provided with adjustable confidence criteria.
  • the rough corpus 1220 can include, for example, two-hundred documents from the document pool 1004.
  • a human reviewer then manually reviews the documents selected by the subj_ pool program 1216 for inclusion in the rough corpus 1220.
  • the human reviewer can use the chooser program 1224 for this review process.
  • the human reviewer eliminates documents selected for inclusion in the rough corpus 1220 using the preliminary lexicon 1238 if the human viewer determines that the reviewed document does not belong in the selected subject. This purging operation for removing documents incorrectly selected using the preliminary lexicon 1238 is set forth in block 1128 ofthe subject lexicon generator process flow 1 100. Removal ofthe incorrectly included documents ofthe rough corpus 1220 provides the refined corpus 1228.
  • the refined corpus 1228 is then operated upon by the buildjexs program 1236 in the manner described previously with respect to the operation of the buildjexs program 1236 upon the miniature corpus 1232.
  • the buildjexs program 1236 can produce a new preliminary lexicon 1238.
  • the new preliminary lexicon 1238 can be used as a subject lexicon 1020 in the automatic document subjecting system 1000 if a determination is made in step 1116 that lexicon generation is complete and path 1122 followed. Altemately, path 1120 can be repeated and the new preliminary lexicon 1238 can be operated upon by the subj_pool program 1216 to provide a new rough corpus 1228.
  • the loop through the subject lexicon generation system 1200 including the path 1 120 can be repeated in this manner any number of times. It is believed that the optimum number of repetitions in this manner is two or three. It is also believed that the use of bigrams for selecting documents is most useful in the first pass through path 1120.
  • the accounting manager 138 ofthe data center 110 provides generalized reporting capabilities with respect to a wide range of accounting database information within the accounting database 1 19 (shown in Figure 13).
  • the accounting database information within the accounting database 119 is organized, for example, into databases 1 19a, b, d, e, and g, as described below, in order to facilitate report generating and management functions at the data center 110.
  • the accounting database 1 19 ofthe data center 1 10 includes a subscriber profile database 119a for storing subscriber profile information.
  • the subscriber profile infoimation in the subscriber profile database 119a can include a subscriber identification, a subscriber name, address, telephone number, date of birth, and gender.
  • the subscriber profile information can also include the number of siblings ofthe subscriber, by gender, the occupation ofthe mother and father of the subscriber, and any foreign language spoken by the subscriber. Additionally, the subscriber profile information can include the favorite subjects in school, the hobbies, the extracurricular activities and the favorite magazines ofthe subscriber. The college attendance plans and possible future profession ofthe subscriber can also be included in the subscriber profile information stored in the database 119a.
  • the subscriber profile information in the subscriber profile database 119a can also include the configuration ofthe computer used by the subscriber.
  • the accounting database 119 also includes an accounting records database 119b containing a plurality of accounting records 119c.
  • the accounting record 1 19c within the accounting database 119b can contain a document identification, a publisher identification and a subscriber identification.
  • the accounting record 119c contains a time stamp indicating the time of access ofthe document by a user station 102 and a link to the subscriber profile infoimation in the subscriber profile database 119a.
  • the accounting records 119c permit the performance of accounting and marketing functions within the data center 1 10.
  • One accounting record 1 19c is stored within the accounting records database 119b for each document that is received by a user station 102 in response to a query. Multiple retrievals ofthe same document in response to a single query by the user station 102 result in a single accounting record 119c within the accounting records database 1 19b. Multiple retrievals ofa document in response to different queries result in the addition of multiple accounting records 119c to the accounting records database 119b.
  • An accounting record 119c for a document can be added to the accounting records database 119b within the account database 119 at the time the retrieved document is transferred to the user station 102.
  • accounting database 119 includes a publisher information database 119d for storing publisher database information.
  • the publisher infoimation database 119d includes a publisher information record 440, as previously described, for storing infoimation on each publisher 112 that provides documents to the information retrieval system 100.
  • the infoimation stored for a publisher 112 in a publisher information record 440 can include fields such as a publisher identification field 441, a name field 442, and address fields 443, 444, a contact name field 449, a contact title, telephone number field 451 and address field 450 associated with the publisher 112 can also be stored in the publisher information record 440.
  • the publisher database infoimation record 440 can contain the royalty percentage and the royalty amount by month ofthe publisher 112. Each time a new publisher 112 is added to the information retrieval system 100, a new publisher information record 440 is added to the publisher infoimation database 119d.
  • the accounting manager 138 can perform various royalty, marketing and auditing operations.
  • data center 1 10 can provide reports on document usage, reports on queries that result in no hits or in low relevance hits, reports on the categories of users and reports on the demographics ofthe various users.
  • the data center 110 can also provide time profiles such as time profiles per session, time profiles per week and time profiles on the think time required by the users.
  • the accounting manager 138 ofthe information retrieval system 100 determine and report what queries are submitted by user stations 102, a list of documents retumed to each user station 102 in response to each query submitted, and the relevance of each document retumed for the queries.
  • the document retrieval and the user feedback information can also be determined and reported by the accounting manager 138.
  • the royalty payment computations of royalty database information management require a generalized reporting capability ofthe accounting database 119.
  • the accounting records 119c within the accounting records database 119b can be summarized and sorted according to a number of different criteria.
  • the criteria include for example, publisher 1 19, document identification number, date and frequency of access.
  • a query log table database 119e within the accounting database 119 is used by the data center 110 for maintaining a history of queries made by users.
  • the query log table database 1 19e contains a query log entry 119f for each query applied to the information retrieval system 100 by a user station 102.
  • Each query log entry 1 19f identifies the raw natural language and fields, if any, entered by the user to perform a query.
  • the query log entry 119f also includes the date and time ofthe query as well as the number of matches that resulted from the query.
  • each log entry 119f cross-references the user identification number ofthe user that entered the query into the information retrieval system 100.
  • a new query log entry 119f is added to the query log table database 119e each time a new query is received.
  • a retrieval account table database 119g is also provided within the accounting database 1 19 ofthe data center 1 10.
  • the retrieval account table database 1 19g is used by the infoimation retrieval system 100 to log all document retrievals performed in response to queries When a document retrieval is initiated within the data center 1 10, a document retrieval entry 119h inserted in the document retrieval account table database 119g and the time and date ofthe retrieval is recorded in the document retrieval entry 1 19h
  • Each document retrieval 1 19h is a record that stores information identifying the document retrieved by cross-referencing the document information directory table
  • Each document retrieval entry 119h also identifies the publisher 112 ofthe document by cross-referencing the publisher infoimation database 119d
  • the reference to the publisher information database 119d is used to credit the publisher for the retrieval for the purpose of royalty management
  • Each document retrieval entry 119h also cross-references the query and the user identification number ofthe user that issued the query for forther accounting purposes using the accounting database 119
  • a plurality of marketing reports can be generated For example, reports indicating subscriber profile statistics such as the average age ofthe users or the percentage of users of each gender can be generated.
  • a report indicating the demographical makeup ofthe users that selected the document for retrieval can be generated
  • Subscriber computer statistics such as the percentage of subscribers with different kinds of computers and common modem speeds can be provided
  • Subscriber/query statistics such as the average number of queries per user and the average number of queries per gender can be obtained
  • System utilization statistics are also available For example, retrieval daily peaks and valleys and the average number of retrievals per day can be determined

Abstract

A method and apparatus for identifying textual documents and multi-media files corresponding to a search topic (344-346). A plurality of document records are stored and a plurality of multi-media records are also stored. The document records have text information fields associated therewith. The multi-media records have multi-media information fields for representing only digital video or audio information and associated text fields (343). A single search query corresponding to the search topic is received. The single search query is preferably in a natural language format. An index database is searched to simultaneously identify document records and multi-media records related to the single search query. A search result list having entries representative of both textual documents and multi-media files related to the single search query is generated in accordance with the document records and multi-media records identified by the index database search. Text is retrieved by selecting entries from the search result list representing document records. Digital video or audio information is retrieved by selecting entries from the search result list representing selected multi-media records.

Description

METHOD AND APPARATUS FOR IDENTIFYING TEXTUAL DOCUMENTS AND MULTI-MEDIA FILES CORRESPONDING TO A SEARCH TOPIC
Field of the Invention
The present invention is directed to systems for identifying documents corresponding to a search topic or query. More particularly, the present invention is directed to an automated multi-user system for identifying and retrieving text and multi-media files related to a search topic from a database library composed of information from many various publisher sources.
Background of the Invention Information retrieval systems are designed to store and retrieve information provided by publishers covering different subjects. Both static information, such as works of literature and reference books, and dynamic information, such as newspapers and periodicals, are stored in these systems. Information retrieval engines are provided within prior art information retrieval systems in order to receive search queries from users and perform searches through the stored information. It is an object of most information retrieval systems to provide the user with all stored information relevant to the query. However, many existing searching/retrieval systems are not adapted to identify the best or most relevant information yielded by the query search. Such systems typically retum query results to the user in such a way that the user must retrieve and view every document retumed by the query in order to determine which document(s) is/are most relevant. It is therefore desirable to have a document searching system which not only retums a list of relevant information to the user based on a query search, but also retums the list to the user in such a form that the user can readily identify which information retumed from the search is most relevant to the query topic.
Existing systems for searching and retrieving files from databases based on user queries are directed primarily to the searching and retrieval of textual documents. However, there is a growing volume of multi-media information being published which is not textual. Such multi-media information corresponds, for example, to still images, motion video sequences and digital audio sequences, which may be stored and retrieved by digital computers. It would be desirable from the point of view of an individual using an information searching/retrieval system to be able to be able to query a library or database and identify not only text documents, but also multi-media files that are relevant to user's query. Moreover, it would be desirable if the searching system could retum to the user not only a single list having both text and multi-media information relevant to the query search, but also a list which enabled the user to readily identify which ofthe text and multi-media files were most relevant to the query topic. Each different publisher providing documents that may be retrieved by information retrieval systems typically uses its own information format to store and transmit its information files. Thus, an information searching/retrieval system which has a library database based upon information from many various publishers must be compatible with many different publisher formats. This compatibility requirement can serve to slow the performance of an information searching/retrieval system.
It is well known in the prior art of information retrieval systems to permit a user to specify a single subject of a number of subjects for searching. For example, a user may wish to search only sports literature, medical literature or art literature. This avoids unnecessary searching through database documents that are not relevant to the subject of interest to the user. In order to provide this capability, information retrieval systems must categorize documents received from publishers according to their subject prior to adding them to the database. Subjecting of incoming documents often requires an individual to read each incoming and make a determination regarding its subject. This process is very time consuming and expensive, as there is often a large number of incoming documents to be processed. The subjecting process may be further complicated if certain documents should properly be categorized in more than one subject. It would be desirable to have an automated system for processing incoming documents which categorized each incoming document into one or more subjects, and which did not require an individual to read each incoming document and make a separate judgment categorizing the subject of such document.
When a user of an information searching/retrieval system enters a search query into the system, the query must be parsed. Based on the parsed query, a listing of stored documents relevant to the query is provided to the user for review. In the prior art, it is known to use semantic networks when parsing a query. Semantic networks make it possible to identify words not appearing in the query, but which correspond to or are associated with the words used in the query. The number of words used to search the database is then expanded by including the corresponding words or associated words identified by the semantic network in the search instructions. This procedure is used to increase the number of relevant documents located by the infoimation searching/retrieval system. Although semantic networks may be useful for finding additional relevant documents responsive to a query, it is believed that use of such networks also tends to increase the number of irrelevant documents located by the search. In fact, it is generally believed that the number of additional relevant documents identified through the use of semantic networks is roughly equal to the number of irrelevant documents which are also brought into the search results list as a result ofthe semantic network. It would be desirable to have a system for implementing a semantic network which maximized the number of relevant documents identified during the search, without substantially increasing the number of irrelevant documents found by the search.
Many publishers that provide documents to information retrieval systems require record-keeping in order to ensure accurate royalty payments. Record-keeping permits the publishers to determine the interest level in various documents produced by the publisher, and the demographics of users retrieving such documents. Thus, it would be desirable to have a searching retrieval system that tracked not only how often each document stored in the system database was retrieved by users, but also the demographics ofthe users retrieving the documents and the query searches used to identify and retrieve such documents. It is therefore an object ofthe present invention to provide a searching/retrieval system which can query a library or database and identify not only text documents, but also multi-media files stored on the library or database that are relevant to query.
It is a further object ofthe present invention to provide a searching/retrieval system that accepts a query and retums a single search results list having both text and multi¬ media information, which list is presented in a format that enables the user to readily identify which ofthe text and multi-media files are most relevant to the query topic.
It is a still further object ofthe present invention to provide a scalable computer architecture for implementing a searching/retrieval system which can query a database and identify text documents and multi-media files stored on the database that are relevant to query.
It is a still further object ofthe present invention to provide an infoimation searching/retrieval system which has a library database based upon information from many various publishers, and which is compatible with many different publisher formats.
It is a still further object ofthe present invention to provide an information searching/retrieval system which has a library database based upon information from many various publishers, and wherein such information is stored in a central database in one or more common information formats.
It is a still further object ofthe present invention to provide an automated system for processing incoming documents to be stored on a library or database, which system categorizes each incoming document into one or more subjects, and which does not require an individual to read each incoming document and make a separate judgment categorizing the subject of such document.
It is a still further object ofthe present invention to provide a system for implementing a semantic network which maximizes the number of relevant documents identified during the query search, without substantially increasing the number of irrelevant documents found by the search.
It is a still further object ofthe present invention to provide a system for using a semantic network which maximizes the number of relevant documents identified during a query search by semantically expanding the search in response to the part of speech associated with each query term in the search. It is a still further object ofthe present invention to provide a searching system that queries a database to determine text documents and multi-media files relevant to the query, wherein weightings associated with proper nouns and slow words are adjusted prior to searching the database.
It is a further object ofthe present invention to provide a searching/retrieval system that accepts a query and retums a single search results list including document relevance values, wherein the document relevance values are independent ofthe number of terms in the query.
It is yet a still further object ofthe present invention to provide a searching/retrieval system that tracks not only how often each document stored in the system database was retrieved by users, but also the demographics ofthe users retrieving the documents and the query searches used to identify and retrieve such documents.
These and other objects and advantages ofthe invention will become more fully apparent from the description and claims which follow or may be learned by the practice ofthe invention.
Summary of the Invention The present invention is directed to a method and apparatus for identifying textual documents and multi-media files corresponding to a search topic. A plurality of document records, each of which is representative of at least one textual document, are stored, and a plurality of multi-media records, each of which is representative of at least one of multi-media file, are also stored. The document records have text information fields associated therewith, each ofthe text information fields representing text from one ofthe plurality of textual documents. The multi-media records have multi-media infoimation fields for representing only digital video (i.e.. still images or motion video image sequences), digital audio or graphics infoimation, and associated text fields, each ofthe associated text fields representing text associated with one ofthe multi-media information fields. A single search query corresponding to the search topic is received. The single search query is preferably in a natural language format. An index database is searched in accordance with the single search query to simultaneously identify document records and multi-media records related to the single search query. The index database has a plurality of search terms corresponding to terms represented by the text infoimation fields and the associated text fields. The index database also includes a table for associating each ofthe document and multi-media records with one or more ofthe search terms. A search result list having entries representative of both textual documents and multi-media files related to the single search query is generated in accordance with the document records and the multi-media records identified by the index database search. Text corresponding to the search topic is retrieved by selecting entries from the search result list representing document records to be retrieved, and then retrieving text represented by the text information fields associated with the selected document records.
Digital video, audio or graphics information corresponding to the search topic is retrieved by selecting entries from the search result list representing selected multi-media records to be retrieved, and then retrieving digital video, audio or graphics infoimation represented by multi-media information fields associated with the selected multi-media records. In accordance with a further aspect, the present invention is directed to a computer-implemented method and apparatus for composing a composite document on a selected topic from a plurality of information sources by searching the plurality of information sources and identifying, displaying and copying files corresponding to the selected topic A plurality of records, each of which is representative of at least one infoπnation file, are stored in a database A single search query corresponding to the search topic is received The database is searched in accordance with the single search query to identify records related to the single search query A search result list is then generated having entries representative of information files identified during the database search, and the search result list is displayed in a first display window open on a user display Signals representative of at least first and second selected entries from the search result list are received from the user, the first and second selected entries respectively corresponding to first and second information files A second display window for displaying at least a portion ofthe first information file is opened on the user display, a third display window for displaying at least a portion ofthe second infoimation file is opened on the user display, and a document composition window for receiving portions ofthe and second first information files is opened on the user display The composite document is then composed by copying portions ofthe first and second information files from the second and third display windows, respectively, to the document composition window
In accordance with a still further aspect, the present invention is directed to a split-server architecture for processing a search query provided by a user, and identifying and retrieving documents from a database corresponding to the search query A session server is provided for receiving the search query from the user The session server has at least a first processor coupled to the user over a communications channel A query server is coupled to the session server The query server has at least a second processor coupled to a first database having records representative ofthe documents to be searched The query server includes means for receiving the search query from the session server, searching means for searching the first database to identify documents responsive to the search query, and means for sending search results information representative ofthe documents identified by the searching means from the query server to the session server The session server includes means for sending the search query to the query server, means for receiving the search results information from the query server, means for sending a search results list representative ofthe search results information across the communications channel to the user, means for receiving a document retrieval request transmitted from the user over the communications channel, means for retrieving a document in response to the retrieval request and transmitting a file representative ofthe document to the user over the communications channel, and means for incrementing an accounting record on an accounting database coupled to the session server, the accounting record representing a number of retrievals of the document by the session server In accordance with a still further aspect, the present invention is directed to a method for preparing input information having differing input formats from different information sources for storage in an information retrieval system having a database with a database index for retrieval ofthe input information from the database First and second input infoimation having differing input information formats are received The input information in one format is converted from the input format to an information retrieval system format to provide reformatted information The infoimation from the other information format is converted into the infoimation retrieval system format to provide further reformatted infoimation, whereby the input information in the differing input formats is converted into a single infoimation retrieval system format The reformatted infoimation is stored in the database according to the single information system retrieval format and retrieved from the database according to the single infoimation retrieval system format
In accordance with a still further aspect, the present invention is directed to a method for determining a part of speech of words in a sentence or sentence fragment A hidden Markov model for determining the most likely part of speech for the words in the sentence or sentence fragment is provided, wherein the hidden Markov model has an initial transition matrix and a subsequent transition matrix for storing the probabilities of transitions from one part of speech to another The initial matrix ofthe hidden Markov model is effectively removed by making the probabilities therein equal to each other to provide a modified hidden Markov model The modified hidden Markov model is applied to the sequence of words to determine the most likely part of speech of words within a sentence fragment with increased accuracy
In accordance with yet a further aspect, the present invention is directed to a method for storing input information in an infoimation retrieval system database wherein a plurality of information subject categories are provided A plurality of subject lexicons are provided, each subject lexicon ofthe plurality of subject lexicons corresponding to an information subject category ofthe plurality of information subject categories Each subject lexicon contains information representative of its corresponding information subject category The input information is compared with the subject lexicons and the input information is stored in a selected information subject category according to the comparing ofthe input infoimation with the subject lexicons
In accordance with yet a further aspect, the present invention is directed to a method for storing information in an infoimation retrieval system having a database for retrieval ofthe input information in response to a query Text infoimation representative of text is received for storing in the system Image information representative of an image is also received for storing in the system Additionally, image text information representative of text associated with the image information is received The image information is stored in an image information format The text information and the image text information are stored in a common text infoimation format whereby the format ofthe stored text infoimation is identical to the format ofthe stored image text information The text infoimation and image text information are searched in the common text information format and the text information and image text information are identified in response to a single query The image information associated with the retrieved image text information is selected and the selected image infomiation is retrieved whereby the text information and the image information are retrieved in accordance with the same query
In accordance with still yet a further aspect, the present invention is directed to a method for searching a database of an information retrieval system in response to a query having at least one query word with a part of speech, for applying the query word to the database and selecting information from the database according to the query word A semantic network is provided for determining expansion words to expand the search ofthe database in response to the query word. The part of speech ofthe selected query word is determined The selected query word is applied to the semantic network to provide one or more query expansion words in response to the selected query word The part of speech of the query expansion word is determined The query expansion word is applied to the database in accordance with the part of speech ofthe selected query word and the part of speech ofthe query expansion word
In accordance with a still further aspect, the present invention is directed to a method for performing a search of a database in an information retrieval system in response to a query having at least one query word with a query word weight and for applying the query word to the database and selecting information from the information retrieval system in accordance with the query word. A query word is selected and assigned a weight. The weight is adjusted depending on whether the query word is a proper noun or slow word. The adjusting can be an increase or a decrease in the weight. Information is selected from the information retrieval system in accordance with the adjusted weight. In accordance with a still further aspect, the present invention is directed to a method for searching a database of an information retrieval system in response to a query having a query length of at least one word, for applying the query word to the database and selecting information from the database according to the query word. The query is received and the length ofthe query is determined. Infoimation is selected from the database according to the query. The relevance ofthe selected information is determined according to matches between the query and the information. The determined relevance ofthe selected infoimation is adjusted according to the length ofthe query.
In accordance with a further aspect, the present invention is directed to a method for searching an infoimation retrieval system having a database containing a plurality of documents from a plurality of document sources in response to a query from a user. A document log table is provided for tabulating document information of documents selected by the user in response to a query from the user. The query is received from the user and a document is selected by the user in response to the received query. The document log table is adjusted in response to the selecting ofthe document. The adjusted log table can be used to determine royalties.
Brief Description of the Drawings In order that the manner in which the above-recited and other advantages and objects ofthe invention are obtained and can be appreciated, a more particular description of the invention briefly described above will be rendered by reference to a specific embodiment thereof which is illustrated in the appended drawings. Understanding that these drawings depict only a typical embodiment ofthe invention and are not therefore to be considered limiting of its scope, the invention and the presently understood best mode thereof will be described and explained with additional specificity and detail through the use ofthe accompanying drawings. Figure 1 is a simplified block diagram showing an infoimation retrieval system in accordance with a preferred embodiment ofthe present invention.
Figure 2 is a simplified process flow diagram illustrating a user session which may be performed with the infoimation retrieval system shown in Figure 1, in accordance with a preferred embodiment ofthe present invention.
Figure 3 is a more detailed block diagram showing an information retrieval system in accordance with a preferred embodiment ofthe present invention. Figure 4 is a more detailed process flow diagram illustrating a user session which may be performed with the information retrieval system shown in Figure 3, in accordance with a preferred embodiment ofthe present invention.
Figure 4A is a diagram illustrating an exemplary search results list displayed in an open window on a user's personal computer, in accordance with a preferred embodiment ofthe present invention.
Figure 4B is an exemplary diagram illustrating first and second open windows on a user's personal computer which respectively display text and video information corresponding to document and multi-media files selected by the user for retrieval, in accordance with a preferred embodiment ofthe present invention. Figure 4C is an exemplary diagram illustrating first and second open windows on a user's personal computer which respectively display text and video information corresponding to document and multi-media files selected by the user for retrieval, and a composite document window in which the user has built a composite document based on the text and video information in the first and second windows, in accordance with a preferred embodiment ofthe present invention.
Figure 5 is a diagram illustrating preferred data structures for storing a document information directory table, a dependent image table, and publisher information table, in accordance with a preferred embodiment ofthe present invention.
Figure 5A is a diagram illustrating a preferred data structure for implementing a document index database, in accordance with a preferred embodiment ofthe present invention.
Figure 5B is a diagram illustrating a preferred data storage format for implementing an image/text database, in accordance with a preferred embodiment ofthe present invention. Figure 6 is a block diagram illustrating the operation of software systems for implementing the session and query managers shown in Figure 4, in accordance with a preferred embodiment ofthe present invention. Figure 6A is a state flow diagram showing the operation ofa session manager software system, in accordance with a preferred embodiment ofthe present invention.
Figure 6B is a flow diagram showing the operation ofa search engine software system, in accordance with a preferred embodiment ofthe present invention. Figure 7A is a block diagram ofa hidden Markov model suitable for parsing full sentences.
Figure 7B is a block diagram ofa hidden Markov model for parsing sentence fragments, in accordance with a preferred embodiment ofthe present invention.
Figure 8 A is a table of relevance normalization values for normalizing relevance scores output by a search engine, in accordance with a preferred embodiment of the present invention.
Figure 8B is a graph illustrating a system for normalizing relevance scores output by a search engine, in accordance with a preferred embodiment ofthe present invention. Figure 9 is a block diagram representation ofthe data preparation component ofthe information retrieval system of Figure 3, in accordance with a preferred embodiment ofthe present invention.
Figure 9A is a block diagram representation of data flows within the data preparation component of Figure 9, in accordance with a preferred embodiment ofthe present invention.
Figure 10 is a block diagram representation of an automatic subjecting system for automatically determining the subject category of input documents, in accordance with a preferred embodiment ofthe present invention.
Figure 11 is a process flow representation ofa method for generating subject lexicons for use in the automatic subjecting system of Figure 10, in accordance with a preferred embodiment ofthe present invention.
Figure 12 is a block diagram ofa system for generating subject lexicons for use in the automatic subjecting system of Figure 10, in accordance with a preferred embodiment ofthe present invention. Figure 13 is a representation of data structures within an accounting database, in accordance with a preferred embodiment ofthe present invention. Detailed Description of the Invention
Referring now to Fig. 1, there is shown a simplified block diagram illustrating an infoimation retrieval system 100, in accordance with a preferred embodiment ofthe present invention. The information retrieval system 100 includes a user station 102 for searching infoimation files which have been collected from various publisher sources 1 12 and stored in data center 1 10. The user station 102 includes a personal computer (PC) 104 and user software 106 which resides on PC 104. User software 106 includes a graphical user interface (shown generally in Figures 4A, 4B and 4C). The user station 102 provides search queries by way of a communications channel 108 (such as, for example, a large volume public network or the Intemet) coupled to the data center 110. The data center 1 10 includes session server 114 which includes means for receiving a search query from user station 102, means for sending the search query to a query server 116, means for receiving search results information from the query server 116, means for sending a search results list representative ofthe search results information across communications channel 108 to the user station 102, means for receiving a document retrieval request transmitted from user station 102 over communications channel 108 to session server 114, and means for retrieving a document from database 118 in response to the retrieval request and transmitting a file representative ofthe document to user station 102 over communications channel 108 The query server 116 at data center 110 includes means for receiving a search query from the session server 114, searching means for searching a document index database 1 17 (shown in Fig. 3) to identify documents responsive to the search query, and means for sending search results infoimation representative ofthe documents identified by the searching means from the query server 1 16 to the session server 1 14. Data center 110 also includes a library database 1 18 for storing text, image, audio or other multi-media information representative of files provided by a plurality of publishers 112. As explained more folly below, session server 1 14 retrieves (from library 118) documents identified by a search query and selected by a user of user station 102 for retrieval, and then transmits the selected documents to the user station 102 over channel 108.
Referring now to Fig. 2, there is shown a simplified process flow diagram illustrating a user session 200 which may be performed with information retrieval system 100 shown in Figure 1, in accordance with a preferred embodiment ofthe present invention. In step 202 of user session 200, the user station 102 communicates to data center 1 10 (via channel 108) a description ofthe information that a user of user station 102 would like to identify at data center 110. More specifically, in step 202 the a user of user station 102 sends a "natural language search query" to data center 110. As described more folly below in connection with Figure 4, the term "natural language search query" is used to refer to a question, sentence, sentence fragment, single word or term which describes (in natural language form) a particular topic or issue for which a user of user station 102 seeks to identify information. Based on the natural language query provided by user station 102, the query server 1 16 in data center 110 searches a document index database 117 (shown in Figs. 3 and 5 A) coupled to the query server, and a list of files responsive to the search query are retumed to user station 102, as shown in step 204. Next, in step 206, the the user of user station 102 may select for retrieval one ofthe listed files identified by data center 1 10. In step 208, session server 114 in data center 110 retrieves the foil text, image, audio or other multi-media infoimation associated with a selected file from the library database 118. In step 210, the user of user station 102 is given the option of keeping or discarding portions ofthe retrieved file. As explained more folly below, the user of user station 102 may use some or all of any retrieved file to compose a document directed to the search query.
Detailed Overview of System Operation Referring now to Fig. 3, there is shown a more detailed block diagram showing the overall operation of information retrieval system 100, in accordance with a preferred embodiment ofthe present invention. The information retrieval system shown in Figure 3 is a more detailed representation ofthe infoimation retrieval system 100 shown in Figure 1. As shown in Figure 3, data center 110 includes image/text database 118 for storing document files representative of each ofthe publisher documents 1 12 received in data center 110. The image/text database 1 18 is coupled to session server 114 which, as explained more folly below, uses image/text database 118 to retrieve the foil text/image of individual documents selected by user station 102 during a user session. An accounting database 119 is also coupled to session server 114. The accounting database 119 is used to store royalty and marketing information generated during operation of system 100. A document index database 117 is coupled to query server 116. As explained more fully below in connection with Figure 4, document index database 1 17 contains a list of search terms corresponding to potential search terms which may appear in a search query. For each search term listed in the document index database 117, document index database 117 stores the document identification number corresponding to each document file (stored in image/text database 1 18) that includes that search term, along with location infoπnation corresponding to the location ofthe search term in each such document file Although in the preferred embodiment ofthe present invention, database 118 stores document files representative of textual documents, still digital video images, motion digital video sequences, and digital audio sequences, it will be understood by those skilled in the art that any other type of multi¬ media object file, such as, for example, multi-media object files containing graphical images, may also be stored on database 1 18. Responsibility for executing the tasks associated with receiving queries from user station 102, processing and performing searches based on such queries, retuming the results ofthe queries to user station 102, and retrieving and transmitting documents selected by a user to user station 102, are divided between session server 1 14 and query server 1 16, which are connected to each other and to channel 108 by Fiber Distributed Data Interface (FDDI) 124. For purposes of the present application, the term "server", when used in conjunction with "session" or "query", is used to refer to a physical machine formed from at least one computer processor having associated memory and software installed thereon for executing the functions to be performed by the server. In the prefened embodiment ofthe present invention, the hardware platform used for implementing session server 1 14 consists ofa Tandem* 4412 computer having 2 processors, 256 MB of memory, a 2 GB system disk, and a 2 GB RAID disk; a flow diagram illustrating the operation of a preferred software system 1 14a for implementing session server 114 on this hardware platform is shown in Figure 6 and discussed in connection with Figures 4 and 6A below In the preferred embodiment ofthe present invention, the hardware platform used for implementing query server 116 consists ofa Tandem® 4412 computer having 2 processors, 1 GB of memory, a 2 GB system disk, and a 8 GB RAID disk, a flow diagram illustrating the operation ofa preferred software system 1 16a for implementing session server 1 16 on this hardware platform is shown in Figure 6 and discussed in connection with Figures 4 and 6B below Although specific hardware is disclosed herein for implementing session server 114 and query server 116, it will be understood by those skilled in the art that other suitable hardware platforms may altematively be used to implement servers 114, 116 However, the two Tandem* hardware systems described above for implementing servers 114, 116 are preferred because these systems allow servers 114, 116 to be hardware-scalable. This "hardware scalability" allows data center 1 10 to handle an increasing number of user stations 102 simply by adding further processors to the existing hardware used for servers 114, 1 16, without modification ofthe software running on such hardware. In addition to managing the search and flow of infoimation during a user session, the session manager 114 attends to various accounting and subscriber services. The infoimation required for perfoiming the accounting and subscriber services is stored in a separate accounting database 119. In performing these operations, session manager 114 functions in conjunction with enrollment means 120 and authentication means 122 which are also coupled to accounting database 119. As described more folly below in conjunction with Figure 4, enrollment means 120 functions to initiate new (first time) users into data center 110. Enrollment means 120 "enrolls" each new user into data center 1 10 by causing subscriber information representing the new user to be entered into accounting database 1 19. On subsequent accesses to data center 1 10 by a previously enrolled user, authentication means 122 determines that the user is already enrolled in the data center 110 by accessing records stored within the accounting database 119.
Data preparation within the document preparation/accounting means 126 of data center 110 includes means for document filtering, document loading and indexing of new documents received by the data center 110 from a plurality of document publishers 112. Documents received from a publisher 112 are in a publisher-specific format. In the filtering step ofthe document preparation process, a document received from a publisher 112 is converted to a universal publisher processing format. The universal publisher processing format is a single format used by data center 110 for storing and retrieving all documents received from various publishers 112. Referring now to Figure 4, there is shown a process flow diagram illustrating the steps in an exemplary user session 300 which may be performed using the infoimation retrieval system 100 shown in Figure 3, in accordance with a preferred embodiment ofthe present invention. Initially, in step 305, a user station 102 accesses the data center 1 10 by establishing a communications link over communications channel 108 between PC 104 and FDDI 124 within data center 110. As explained above, communications channel 108 may consist ofa communications link formed over a public network such as the Intemet. Altematively, communications channel 108 may consist ofa communications link foimed between PC 104 and FDDI 124 over a commercial network. Thus, commercial networks such as, for example, the Prodigy* network, the CompuServe* network, or the Microsoft* network, may be used to establish a communications channel 108 for linking PC 104 and FDDI 124. Once this communications link is established, processing proceeds to step 310, where the authentication means 122 attempts to authenticate the user of user station 102. More particularly, based on infoimation input by the user, authentication means 122 determines whether the user is authentic, Le, whether the user is authorized to access data center 1 10. If the user is authentic, but has not used accessed data center 110 previously, processing proceeds to step 315, where enrollment means 120 optionally downloads a copy of software system 106 onto PC 104, and optionally executes software 106. In addition, in step 315 the enrollment means 120 requests the user to provide personal "enrollment information" about him/herself such as, for example, the user's foil name, address, grade level, gender, occupation, the occupations ofthe user's parents, the numbers of brothers and/or sisters ofthe user, the type of computer used by the user, the user's modem speed, the display capabilities ofthe user's display, the size ofthe memory ofthe user's PC, and the identity ofthe communications link (e^g,, the Intemet, the Prodigy® network, the CompuServe® network, or the Microsoft* network) used for accessing the system. Each new user enrolled by enrollment means 120 is assigned a unique user identification number. The enrollment (or subscriber) infoimation received by enrollment means 120 is stored on accounting database 1 19 together with the user identification number corresponding to the new user in the form of individual records corresponding to each enrolled user. In the embodiment shown in Figure 3, enrollment means 120 and authentication means 122 are implemented in software on a separate machine which is physically distinct from the processor(s) used for implementing session server 1 14 In an altemative embodiment (not shown), enrollment means 120 and authentication means 122 may be implemented in software as part of session server 114. In this altemative embodiment, enrollment means 120 and authentication means 122 may be implemented on the same processors) used for implementing session server 114.
Referring still to Figure 4, following steps 310, 315, processing proceeds to step 320, where software 106 on PC 104 waits for the user to input a search query. The search query represents a topic, issue or question about which the user desires to locate information. In the preferred embodiment, the search query is provided by the user in a natural language format. Thus, for example, if the user wants to locate relevant sources of information indicating how many soldiers were killed in Vietnam, the user might enter the query, "How many soldiers were killed in Vietnam." The query may be entered in the form ofa complete sentence, or as phrase or sentence fragment, such as, "Vietnam casualties". In an alternate embodiment, the query may be entered by the user in a non-natural language format wherein the user identifies one or more key words to be searched, whether the key words should be searched conjunctively or disjunctively, and, for key words to be searched conjunctively, the distance between which the key words must fall relative to one and other to come within the scope ofthe search. Following the inputting ofthe search query, processing proceeds to step 325 where software system 106 spell checks the search query. The purpose of spell checking the search query prior to sending the query to data center 110 is to prevent the wasting of resources at data center 110 that would likely result if data center 1 10 attempted to process a search query with a spelling error. After the search query has been spell checked, processing proceeds to step 330, where the spell checked search query is transmitted by PC 104 over communications channel 108 to session server 114.
Upon receipt ofthe query by session server 114, processing proceeds to step 335, during which the search query is transmitted from the session server 1 14 to the query server 116 which processes the search query. More particularly, in this step, query server 116 scans its document index database 117 based on the user's search query in order to identify document identification numbers representing textual documents and multi-media files relevant to the search query.
The document index database 117 searched by query server 116 contains a list of search terms corresponding to potential search terms which may appear in or be related to words or terms in a search query. For each search term listed in the document index database 117, document index database 1 17 stores the document identification number corresponding to each document file (stored in database 118) that includes that search term, along with location information corresponding to the location ofthe search term in each such document file. Further details regarding the structure and operation of document index database 117 are shown in Figure 5B and discussed later in this specification.
The term "document file" is used in the present application to mean a file in database 1 18 which stores either a textual document or, altematively, a file which stores a multi-media record such as, for example, a still image, a sequence of motion digital video frames, or a sequence of digital audio frames. In contrast to files solely representing textual documents, each multi-media file in database 1 18 is stored along with a separate portion of text related to the multi-media record (referred to below as the "associated text field" or "image text -type information field" conesponding to each such multi-media file). This associated text field is used as the basis for generating document index information (for storage on document index database 117) corresponding to each multi-media file stored in database 1 18. More particularly, for each search term listed in the document index database 117, document index database 117 stores the document identification number corresponding to each multi-media file having an associated text field which includes that search term, along with location infoimation conesponding to the location ofthe search term in the associated text field. In view ofthe fact that document index database 1 17 stores index information for both textual document files and multi-media files, the scan of document index database 117 by query server 116 in step 335 simultaneously identifies document identification numbers representing both textual documents and multi-media records relevant to the search query. In a preferred embodiment ofthe present invention, the associated text or image text-type information field associated with a multi-media record such as, for example, a still image, may consist of a caption, article or other text which a publisher 1 12 may have published in connection with the still image. The identification numbers ofthe textual documents and multi-media files identified during the search are then used by the query server 1 16 to form a search results list which contains entries representing the document identification number of each textual document and multi-media file identified in the search. In the preferred embodiment, query server 116 performs a relevance ranking on each ofthe textual documents and multi-media files identified by the search by generating a relevance score corresponding to each ofthe entries on the search result list. This relevance scoring operation is performed by query server 116 based on, among other things, the term location information contained in index database 117. More particularly, the relevance score for each document file is based in part on the relative proximity within the document file of terms foiming the search query, Le.., the closer various search terms are to one and other in the document file, the higher the relevance score. Of course, for textual documents this proximity evaluation is performed based on the location ofthe search terms within the text document itself, whereas for multi- media files, this proximity evaluation is performed based on the location ofthe search terms within the associated text fields of such multi-media files. Once a relevance score has been determined for each document identification number retumed by the search, the query server 1 16 then forms a relevance ordered search result list by ordering the document identification numbers (or entries) in the search result list in accordance with their respective relevance rankings such that an entry having a highest relevance ranking represents a first entry on the relevance ordered search result list. In the preferred embodiment, entries corresponding to textual documents and entries corresponding to multi-media files will typically be interspersed within the relevance ordered search result list. The document index searching and relevance scoring operations performed by query server 116 are preferably implemented in part using commercially available searching software such as the Conquest™ search engine program marketed by Excalibur™ Technologies.
In step 340, following the generation ofthe relevance ordered search result list by query server 116, the relevance ordered search result list is transmitted from query server 116 to session server 114. In the event the search query initially provided by the user in step 320 was a recursive search query (i.e.. the search query was to be performed only on the results of a prior query), a logical AND operation is performed between the document identification numbers on the current search results list and the document identification numbers on the previous search results list to form a new current relevance ordered search results list which session server 114 then transmits to user station 102 via channel 108; otherwise, if the query initially provided in step 320 was not a recursive search query, session server 114 simply transmits the relevance ordered search results list from query server 116 directly to user station 102. Following receipt ofthe relevance ordered search results list by user station 102, session server 114 retrieves bibliographical information (from database 1 18) corresponding to the document identification numbers in the search results list, and this bibliographical information is then transmitted to user station 102 over channel 108. For each document associated with a document identification number on the search results list, the bibliographical infoimation retrieved and transmitted in this step includes, for example, the document title, author, publisher name, publication date, publication name and edition. A preferred data structure used for retrieving this bibliographical information from database 118 is discussed below in conjunction with Figure 5.
Upon receipt of this bibliographical information, user PC 104 together with software 106 display the search results list (or a portion thereof), together with (i) the relevance scores corresponding to each document identification number on the search results list (ii) bibliographical information corresponding to each document identification number on the search results list, and (iii) an indication whether the document file corresponds to a textual document or a multi-media file such as, for example, a file representing a still image. This information is displayed in a display window on PC 104 in relevance ranked order such that the document file(s) with a highest relevance score is/are displayed at the top ofthe list. Figure 4A shows an exemplary search results list displayed in an open window 341 on PC 104. As seen in Figure 4A, the image displayed in window 341 includes relevance scores 342, bibliographical infoimation 343, readability infoπnation 343a, size infoπnation 343b, and a file type indicator 344 (for indicating whether the document file corresponds to a textual document or a multi-media file). Relevance scores 342 are listed in descending order from highest to lowest, and files 345 representing textual documents and files 346 representing multi-media documents are interspersed in the list displayed in window 341. Once the document file information described immediately above is displayed in window 341, the user may either select one ofthe documents listed in window 341 for retrieval, or may instead perform a recursive search or a new search. If the user desires to perform a recursive search, processing proceeds to step 345, after which the process is repeated from step 325. However, as described above, during this recursive searching operation a logical AND operation is performed between the document identification numbers on the current search results list and the document identification numbers on the previous search results list to form a new current search results list in step 340. If the user desires to perform a new (non-recursive) search, processing proceeds to step 325 and the process is again repeated. Altematively, if, after step 340, the user wishes to retrieve a document file for display on PC 104, processing proceeds to step 350 where the user selects one ofthe documents in the search list for display, preferably by "clicking" with a computer mouse on the portion of window 341 wherein infoimation corresponding to the selected document file is displayed. Following the selection ofa document file by the user, PC 104 transmits a signal representative ofthe selected document file to session server 114 over channel 108. Upon receipt of this signal in step 355, session server retrieves from database 118 and transmits to PC 104 the text (if the selected document file corresponds to a textual document), image bitmap (if the selected document file corresponds to a still image), sequence of video frames (if the selected document file corresponds to a motion video sequence) or sequence of digital audio frames (if the selected document file corresponds to a digital audio sequence) associated with the selected document file. In step 360, upon of receipt of such text or multi-media infoimation from session server 114, PC 104 together with software 106 display the text or multi-media information retrieved by session server 1 14 in a second display window open on PC 104. Following step 360, the user may repeat the process from step 350 to retrieve and display further text or multi-media information from other selected document files in further display windows open on PC 104. Figure 4B is an exemplary drawing showing first and second windows 347, 348 which respectively display text information corresponding to a text document file 345 and video image infoimation corresponding to multi-media file 346. As seen in Figure 4B, the information displayed in windows 347 and 348 may be displayed simultaneously with the search results list shown in window 341. In addition, software 106 is adapted to permit the user to also simultaneously open a document composition window 348a (shown in Figure 4C) on PC 104. Using the cut, copy and paste buttons on the graphical user interface shown in Figure 4C, the user may cut/copy/paste both text and video information from windows 347,348 (or from further display windows not shown but which represent other selected document files) into the document composition window 348a in order to form a composite document that pertains to the search query input by the user in step 320. The composite document may, for example, include text information from one or more ofthe document files selected by the user in step 350, as well as multi-media information such as, for example, still images. Using the save and print buttons on the graphical user interface shown in Figure 4C, the user may selectively save (on PC 104) and/or print text or image information from windows 347, 348 or 348a.
Data Storage Formats Referring now to Figures 5 and 5B, during operation of data center 1 10, session server 114 retrieves various document/image infoimation from database 1 18. In the preferred embodiment ofthe present invention, a document infoimation directory table is stored on database 118. For each textual document and multi-media file on database 118, the document information directory table includes an individual file header record 400 formed of approximately 25 separate information fields. Record 400 includes a document identification number field 401 for storing a unique document identification number that is associated with each textual document or multi-media file at the time the textual document or multi-media file is prepared for input into data center 110 A discussion ofa preferred system for preparing textual documents and multi-media files for input into data center 1 10 is set forth later in this specification Record 400 also includes a publisher identification number field 402 for storing a unique publisher identification number representing the publisher ofthe textual document or multi-media file associated with record 400 As discussed more fully below, the publisher identification number stored in field 402 corresponds to an entry point into a publisher information table which includes a plurality of publisher information records 440 For each publisher having a textual document and/or multi-media file stored on database 118, the pubhsher information table includes an individual record 440 foimed of approximately 10 separate information fields for stonng infomiation about the publisher associated with the document or file
Record 400 also includes several fields for stonng bibliographical information about the textual document or multi-media file associated with record 400 The bibliographical information fields include a title field 403 for stonng the title ofthe textual document or multi-media file associated with record 400, an author last name field 404 and an author first name field 405 for stonng the name ofthe author ofthe textual document or multi-media file associated with record 400, a pubhsher name field 422 for stonng the name ofthe publisher ofthe textual document or multi-media file associated with record 400, a publication date field 423 for stonng the publication date ofthe textual document or multi¬ media file associated with record 400, a publication name field 424 for stonng a publication name in which the textual document or multi-media file associated with record 400 appeared, and an edition field 425 for stonng a number representing the edition ofthe publication in which the textual document or multi-media file associated with record 400 appeared As described above in connection with Figure 4, session server 114 retrieves (from database 1 18) bibliographical information corresponding to the fields set forth immediately above in step 340 (of user session 300) for each document identification number included on the search results list transmitted from query server 1 16, and then transmits such information to PC 104 where it is displayed as bibliographical information 343
Record 400 also includes several fields for stonng copynght infoimation about the textual document or multi-media file associated with record 400 The copyright information fields include a copyright date field 406 for storing a copyright date assigned to the textual document or multi-media file associated with record 400, a copyright holder field 407 for storing the copyright owner ofthe textual document or multi-media file associated with record 400, and a copyright message field 408 for storing a copyright message for the textual document or multi-media file associated with record 400. In a preferred embodiment ofthe present invention, when a user selects a document for retrieval in step 355 (of user session 300), the information stored in the copyright message field 408 associated with the selected document is retrieved (from database 1 18) by session server 114 and transmitted to PC 104. In this embodiment, the copyright message stored in field 408 is then displayed on user PC 104 (in step 360) together with the text, image bitmap, or sequence of video frames associated with the selected document.
Record 400 also includes a source type field 409 for storing infoimation classifying the textual document or multi-media file associated with record 400 into one of a plurality of "source types." In a preferred embodiment ofthe present invention, the source types reflect not only the source from which the document may have originated, but also whether the document is a text file or, altematively, a multi-media file. A list of exemplary source types is shown generally in box 349 of Figure 4 A, and include newspapers, maps, pictures (e.g., still or motion video images), magazines, books and TV/radio (transcripts). The system 126 (discussed more folly below) for preparing textual documents and multi- media files for input into the data center 110 assigns a source type to each textual document or multi-media file at the time it is input into data center 110. The source types shown in box 349 may be selected by the user in order to focus a query search on a subset of documents in database 118 associated with one or more source types selected by the user. Record 400 also includes a document readability field 410 for storing a readability score (or reading grade level) for the textual document or multi-media file associated with record 400. In a preferred embodiment, this readability score is determined (by system 126) for each document to be stored in database 1 18 using a standard readability algorithm such as, for example, the Fleish-Kincade algorithm. In a preferred embodiment of step 340 of user session 300 (shown in Figure 4), session server 114 retrieves (from database 118) a readability score (from field 410) for each document identification number included on the search results list transmitted from query server 1 16, and then transmits this readability information (along with bibliographical information for each such document identification number) to PC 104 where it is displayed as readability information 343a. Record 400 further includes a subject information field 41 1 for storing information classifying the textual document or multi-media file associated with record 400 into one or more ofa plurality of "subjects." In a preferred embodiment ofthe present invention, the possible subjects into which a document may be classified include, for example, sports, politics, literature, etc. The system 126 (discussed more folly below) for preparing textual documents and multi-media files for input into the data center 1 10 assigns one or more subject types to each textual document or multi-media file at the time it is input into data center 110. In addition, for each subject to which the document is assigned, system 126 assigns a subject confidence score indicating the degree to which the document relates to the subject. The possible document subjects may be selected by the user (by clicking on the "use subjects" box in window 341 to call up subject selection window 349a, and then selecting one or more subjects from subject selection window 349a) in order to focus a query search on one or more selected subjects. In an altemative preferred embodiment ofthe present invention, search engine 142 may be adapted to use the search subjects selected by the user, together with the subject confidence scores associated with documents identified during a query search, to calculate relevance scores for the documents identified during the query search.
Record 400 also includes a document type field 412 for storing a "document type" assigned to the textual document or multi-media file associated with record 400. In a preferred embodiment ofthe present invention, the document type field 412 reflects whether the document is a text file, a still image, a sequence of motion video images, or a sequence of digital audio images. The system 126 (discussed more folly below) for preparing textual documents and multi-media files for input into the data center 110 assigns a document type to each textual document or multi-media file at the time it is input into data center 1 10. In a preferred embodiment of step 340 of user session 300 (shown in Figure 4), session server 1 14 retrieves (from database 118) a document type (from field 412) for each document identification number included on the search results list transmitted from query server 116, and then transmits this document type infoimation (along with bibliographical infoimation for each such document identification number) to PC 104 where it is displayed in the form of a file type indicator 344.
Record 400 also includes a last retrieval date field 413 for storing the date on which the textual document or multi-media file associated with record 400 was last retrieved for a user by document retrieval system 136 (discussed below in conjunction with Figure 6). The information stored in the last retrieval date field is update each time document retrieval system 136 retrieves the textual document or multi-media file associated with record 400 in response to a user request.
Record 400 further includes three fields for storing information about when, and how often, the textual document or multi-media file associated with record 400 has been installed on database 118. In particular, an original install date field 414 is provided for storing the date the textual document or multi-media file associated with record 400 was first stored on database 118, a last install date field 415 for storing the most recent date on which the textual document or multi-media file associated with record 400 was installed on database 1 18, and an install count field 416 for storing the number of times the textual document or multi-media file associated with record 400 has been installed on database 118. Thus, if the textual document or multi-media file associated with record 400 had been installed on database 118 on three separate occasions, field 414 would store the first installation date, field 415 would store the last installation date, and field 417 would store an installation count of three. System 126 for preparing textual documents and multi-media files for input into the data center 110 updates fields 414, 415 and 416 for each textual document or multi-media file at the time it is input into data center 110 and loaded onto database 1 18.
Record 400 further includes a filename field 417 for storing a filename in database 118 wherein either (i) the foil body ofthe textual document represented by record 400 (if record 400 represents a textual document), or (ii) a separate body of text associated the digital video or audio file represented by record 400 (if record 400 represents a multi- media file) is stored. For records 400 representing textual documents, the full body ofthe textual documents are stored in the text-type information fields 470 (shown in Figure 5B). For records 400 representing multi-media files, the separate bodies of text associated with such records are stored in image text-type information fields 460 (shown in Figure 5B). Offset field 418 stores a location pointer indicating the starting location of either the text- type information or image text -type infoimation field corresponding to the filename stored in field 417. Thus, if record 400 represents a textual document, offset field 418 points to the beginning ofa text -type information field 470 wherein the foil body ofthe textual document associated with record 400 is stored; altematively, if record 400 represents a multi-media file, offset pointer 418 points to the beginning of an image text -type information field 460 wherein the associated text field corresponding to the record 400 is stored.
Record 400 also includes a document size field 419 for storing a file size (in bytes) ofthe document associated with record 400. In a preferred embodiment of step 340 of user session 300 (shown in Figure 4), session server 114 retrieves (from database 118) a size (from field 419) for each document identification number included on the search results list transmitted from query server 116 corresponding to a textual document, and then transmits this size information (along with bibliographical information for each such document identification number) to PC 104 where it is displayed as size information 343b. Finally, record 400 includes Dewey decimal call number field 420 and library of congress call number field 421, for respectively storing any Dewey decimal or library of congress call numbers corresponding to the textual document of multi-media file represented by record 400. Referring still to Figure 5, there is shown a dependent image record 430 used for forming a dependent image table for storing multi-media records on database 118, in accordance with a preferred embodiment ofthe present invention. The dependent image table includes a separate dependent image record 430 for each record 400 in the document infoimation directory table which represents a multi-media file. Thus, each record 400 having a document type field 412 corresponding to either a still image, a sequence of motion video images, or a sequence of digital audio images, has a conesponding dependent image record 430 stored in the dependent image table. Record 430 includes a document identification number field 431 for storing the document identification number corresponding to the multi-media file represented by the record 430. For a given multi-media file stored in database 118, the same document identification number used in field 401 of record 400 is repeated in field 431 of record 430. Thus, once session server 1 14 has a document identification number and type infoimation (from document type field 412) that a particular document identification number corresponds to a multi-media file, session server 1 14 can access further information about the multi-media file simply by using the document identification from record 400 as an entry pointer to access a further record 430 in the dependent image table.
Each record 430 includes a filename field 432 for pointing to a file in database 1 18 wherein only digital video or audio information represented by record 430 is stored. For each record 430, this digital video or digital audio information is stored in an image-type field 490 (shown in Figure 5B) on database 118. Offset field 433 stores a location pointer indicating the starting location of the image-type field 490 in the file identified by field 432. Record 430 also includes a document size field 434 for storing a file size (in bytes) ofthe multi-media file associated with record 430. In a preferred embodiment of step 340 of user session 300 (shown in Figure 4), session server 114 retrieves (from database 118) a size (from field 434) for each document identification number included on the search results list transmitted from query server 116 which corresponds to a multi-media file, and then transmits this size infoimation (along with bibliographical infoimation for each such document identification number) to PC 104 where it is displayed as size information 343b. Referring still to Figure 5, as mentioned above, the publisher identification number stored in field 402 of each record 400 corresponds to an entry point into a publisher information table which includes a plurality of publisher information records 440. Each publisher information record 440 includes several information fields for storing information about a publisher having textual documents or multi-media files stored on database 118. In particular, each publisher information record 440 includes a publisher identification number field 441 for storing a unique identification number associated with the publisher represented by the record 440, fields 442-448 for storing name and address infoimation associated with the publisher represented by the record 440, and fields 449-451 for storing infoimation identifying a contact person at the publisher represented by the record 440.
Referring now to Figure 5 A, there is shown a diagram illustrating a preferred data structure 460 for implementing document index database 117, in accordance with a preferred embodiment ofthe present invention. Data structure 460 contains a list of search terms 461 corresponding to potential search terms which may appear in a search query. For each search term 461 listed in data structure 460, data structure 460 stores a document identification number 462 representing each document file (stored in database 118) that includes that search term, along with location information 463 corresponding to the location(s) ofthe search term in each such document file. More particularly, for a given search term 461 and a given document identification number 462, location infoimation 463 represents the location of the search term 461 in either the text-type information field 470 or the image text-type information field 460 pointed to by the offset field 418 associated with the document identification number 462. Thus, if document identification number 462 represents a textual document, location information 463 will represent the location ofthe search term 461 in the actual body ofthe text document; altematively, if the document identification number 462 represents a multi-media file, location information 463 will represent the location ofthe search term 461 in the associated text field corresponding to the multi-media file
Session/Ouerv Server Architecture Referring now to Figure 6, there is shown a block diagram illustrating the operation ofa session server software system 1 14a and a query server software system 116a for respectively implementing the session and query servers 114, 1 16 shown in Figure 4, in accordance with a preferred embodiment ofthe present invention Although the operation of data center 110 as described above involves only a single user station 102 and a single user session 300, in the preferred embodiment ofthe present invention session and query servers 1 14, 116 in data center 110 are adapted to simultaneously receive and process different search queries from multiple user stations 102. In accordance with this multi-user aspect of data center 110, session server software 114a includes a plurality of session managers 130, each of which is responsible for monitoring and directing a single active user session 300 corresponding to an individual user station 102 For each active user session 300, the corresponding session manager (i) sends queries received from the user station 102 to a query engine interface 134, (ii) retrieves a search results list (described above) from the query engine interface 134, (iii) retrieves bibliographical information corresponding to the documents identified in the search results list from database 1 18a and transmits such information to user station 102, (iv) retrieves text and multi-media files identified by user station 102 from database 118b using document retrieval system 136 and transmits such files to user station 102, (v) transmits information representing each document retrieved from database 118b and user identification information corresponding to the user station 102 that requested retrieval of that document to accounting manager 138, and (vi) updates a query log of all search queries handled through the session manager 130 An active session manager component 132 is coupled to the session managers 130, and maintains a list of all active user sessions 300 by tracking the user identification number corresponding to each user station 102 engaged in an active session 300 The document retrieval system 136 accessed by session managers 130 includes separate retrieval systems (not shown) for retrieving text and multi-media files (such as images) from database 118b. For each document file stored in database 118, records 400 (and 430, if applicable) corresponding to the document file and representing infoimation such as, for example, a document identification number, bibliographical data and publisher data are stored in database 118a together with one or more pointers (417, 418, 432, 433) corresponding to the location ofthe foil text/image corresponding to the document file in database 118b. During retrieval of a textual document or multi-media file from database 118, document retrieval system 136 initially retrieves pointer infoimation corresponding to the foil text/image ofthe file from the Document Information Directory Table and Dependent Image Table on database 118a, based on the document identification number ofthe document file to be retrieved. Based on this pointer information, document retrieval system 136 then retrieves the full text/image corresponding to the file from database 1 18b.
Query server software 1 16a includes a plurality of search engines 142, each of which is responsible for (i) executing a search query passed to the search engine 142 by session server 1 14a, and (ii) returning a relevance ordered search results list corresponding the query to session server 114a. The search engines 142 are each coupled to document index database 117, and schedulers 144 monitor and queue the searches performed by search engines 142 on document index database 117. Parameters representing search queries and search results are transmitted between a search engine 142 and session server 114a using a search engine application program interface (API) 140.
Operation of Session Manager Referring now to Figure 6A, there is shown a state flow diagram illustrating the operation of session manager software system 130, in accordance with a preferred embodiment ofthe present invention. Session manager software system 130 maintains the state of each ongoing user session in system 100. Entry into session manager software system 130 begins with initial/profile state 130a, which is invoked when a user login request is passed to data center 110 through communications channel 108. In the initial/profile state 130a, session manager software system 130 functions with enrollment means 120 and authentication means 122 to verify user authenticity and/or enroll new users into data center 110. The tasks performed by enrollment means 120 and authentication means 122 in state 130a correspond substantially to the enrollment and authentication functions described above in connection with steps 310 and 315 of user session 300. Once session manager software 130 authenticates a user and verifies that there are no other user sessions corresponding to the user's identification number which are active on data center 1 10, session manager software system 130 enters a wait for user action state 130b, which waits for a user station 102 to request services from session manager 130. As each user request is received, session manager software system 130 transitions the session server 114 into an appropriate state, and then retums back to the wait for user action state 130b after the request has been serviced. In the preferred embodiment, entry into the document retrieval state 130d and the recursive query state 130e may occur only after query server 116 has processed a search query for the user station 102, and retumed a valid search results list for the search query to session server 114.
When the wait for user query state 130b is initially entered following a transition from initial/profile state 130a, the only valid transition out ofthe wait for user query state 130b is to query state 130c. The query state 130c is responsible for collecting search parameters input to the user station 102, sending the search parameters to the query server 116, retrieving a search results list form the query server 116, and then passing search results back to the user station 102. In the prefened embodiment, the search parameters collected from the user station 102 and forwarded to query server 1 16 include, at a minimum, a query string representing a natural language query entered by the user. The search parameters may also include one or more document source selections which the user of user station 102 may optionally make using box 349 (of Figure 4A). Such document source selection(s) permit the user of user station 102 to focus the user's search only on document records having source type fields 409 corresponding to one or more selected document sources. The search parameters may also include one or more document subject selections which the user of user station 102 may optionally make using box 349a (of Figure 4A). Such document source selection(s) permit the user of user station 102 to focus the user's search only on document records having subject information fields 41 1 corresponding to one or more selected subjects. Upon successful completion ofthe search, the query state 130c is responsible for receiving a relevance ordered search result list from the query engine 1 16, and then passing a relevance score 342, bibliographical information 343, readability information 343a, size information 343b and a file type indicator 344 corresponding to each document identification number on the search result list to user station 102 for display in window 341. Query state 130c is responsible for forming a "current search results list" by saving the search results list generated by query engine 1 16. The "current search results list" may be used in recursive query state 130e to allow the user of user station 102 to perform recursive searching. Upon completion ofthe above tasks, query state 130c retums control to the wait for user action state 130b.
In the event the user of user station 102 selects a document (from window 341) for retrieval, session manager software 130 enters document retrieval state 130d. In this state, session server 114 receives a signal from user station 102 representing the document identification number ofa particular document from window 341 that the user wishes to retrieve into a window 347, 348. State 130d is responsible for retrieving (using retrieval system 136) the full text, digital video or digital audio information corresponding to the selected document identification number from database 118, and passing then transmitting this document infoimation back to user station 102. Once the appropriate document infoimation is passed to user station 102, control is passed back to the wait for user action state 130b.
If, after receiving results the results ofa first query, the user of user station 102 indicates a desire to perform a recursive query search, session manager software system 130 enters recursive query state 130e. The recursive query state 130e is responsible for collecting a forther set of search parameters input by the user of user station 102, sending these search parameters to the query server 116, and retrieving a search results list from the query server 1 16. A logical AND operation is then performed on the document identification numbers in this search results list and the document identification numbers in the search results list previously saved (either during state 130c or 130e) in order to form a new current search results list which is then passed back to the user station 102. Once the new current search result list has been generated and saved, the query state 130e is responsible for passing a relevance score 342, bibliographical information 343, readability infoimation 343a, size infoπnation 343b and a file type indicator 344 corresponding to each document identification number on the new current search result list to user station 102 for display in window 341. Upon completion ofthe above tasks, recursive query state 130e retums control to the wait for user action state 130b.
When the user of user station 102 desires to terminate a user session, session manager software 130 enters an exit state 13 Of, wherein session manager software 130 causes the active session manager component 132 to remove the user identification number corresponding to the user of user station 102 from the list of active user sessions
Operation of Search Engine
Referring now to Fig 6B, there is shown a flow diagram illustrating the operation ofa software system for implementing search engine 142, in accordance with a preferred embodiment ofthe present invention When a query is applied to the query server 116, all the individual terms in the query have an equal baseline weight Thus, all terms in the query are equally likely to contribute to the selection ofa document by the query server 116 However, in order to improve the search results obtained by query server 116, the weights ofthe terms in the query can be adjusted relative to each other in accordance with the query term weight adjustment method shown in Figure 6B
In step 142a, the query fields, which include a string of terms corresponding to a natural language search, are received from the search engine API 140 by the search engine 142 In step 142b, the parts of speech of each term in the natural language query is determined The method for determining the parts of speech ofthe natural language query terms in step 142b includes the use of an improved Hidden Markov Model discussed below An adjustment ofthe weight of each term in the query relative to the weight ofthe other terms ofthe query is determined More particularly, in step 142c, a determination is made whether a particular term in the natural language query (or query word) is a proper noun If the query term is not a proper noun, as determined in step 142c, a determination is made in step 142f whether the query term is a slow word Slow words are words that have been determined, through experience, to usually be of little help in improving the results of searches. However, in some cases they are useful in improving results Therefore, in the preferred embodiment ofthe invention, the weight of slow words is reduced, but the slow words are not eliminated from the query. The weight of slow words can be adjusted, for example, to about half of the baseline value Some examples of slow words are best, better, say, really and now. If the query term is not a slow word or proper noun, it is applied to a semantic network in step 142d to determine expansion words corresponding to the query term For purposes ofthe present invention, the term "semantic network" means a database in which each term carries with it semantic information such as, for example, part of speech and synonym information. For a given query word, only those expansion words from the semantic network which are ofthe same part of speech as the query term are added to the terms in the natural language query in step 142f. In step 142j, the weight adjusted terms from the original natural language query, together with any additional search words obtained by the semantic network expansion are applied to index database 1 17 in order to (i) identify documents (stored in database 118) that are relevant to the query search, (ii) and determine a relevance score for each document identified. In the preferred embodiment, step 142j is implemented using the Conquest™ Search engine marketed by Excalibur Technologies (hereinafter, the Conquest™ searching software).
If a query word is a proper noun, as determined in step 142c, its weight is increased in step 142g, and this increased searching weight is then used by the Conquest™ searching software in calculating relevance scores for documents identified by the search engine. The proper nouns may be adjusted, for example, to about one and half times the baseline value. Additionally, proper nouns are not applied to the semantic network for expansion. It has been determined that by-passing the semantic network in this manner for proper nouns decreases the amount of noise in the results obtained by the database search. For example, if the user submits a query search that includes the name "George Bush," a semantic network that was used to expand the term "Bush" might retum, for example, words related to shrubbery because the words "bush" and "shrubbery" are synonyms. Since all search hits generated as a result ofthe words related to shrubbery would constitute unnecessary noise in a search directed at George Bush, the present invention avoids the addition of such noise by omitting proper nouns during the semantic network expansion of step 142d. The output ofthe search performed in step 142j includes a relevance score for each document identified during the search. In step 142h (described more folly below under the heading "Relevance Normalization"), the output relevance scores are normalized. Finally, in step 142L, the document identification numbers (identified in step 142j) and normalized relevance scores corresponding to such document identification numbers are transmitted to search engine API 140. Hidden Markov Model Referring now to Fig. 7a, there is shown a diagram illustrating hidden Markov model 700. The hidden Markov model 700 is a conventional hidden Markov model used for determining parts of speech of words in natural language queries in the form of complete sentences The hidden Markov model 700 includes an initial transition matrix 710 and a subsequent transition matrix 720 The initial transition matrix 710, sometimes refened to as a Pi matrix, contains a plurality of different probabilities pa, pb , pc that indicate the likelihood that the first word ofa query is a selected part of speech as indicated on the vertical edge ofthe initial transition matrix 710 For example, the probability that the first word ofthe query is a determiner is pa, and the probability that the first word ofthe query is a noun is pb The probabilities ofthe Pi matrix are obtained by counting the occurrences of each part of speech at the beginning ofthe sentences of a sample set of text documents, wherein the part of speech tagging can be performed by a human The subsequent transition matrix 720 includes a probability that indicates the likelihood that the next word ofa query will be a particular part of speech given that the current word has a known part of speech These probabilities are also obtained by counting the occurrences ofthe various parts of speech in a sample set of text documents which are tagged by a human The use of hidden Markov models for the purpose of parsing complete sentences is well known to those skilled in the art of parsing natural language. Referring now to Fig 7b, there is shown a modified hidden Markov model
750 used for determining parts of speech in natural language queries formed from sentence fragments, in accordance with a preferred embodiment ofthe present invention In the modified hidden Markov model 750, all ofthe probabilities in the initial transition matrix 710 are made equal to each other The probabilities in the subsequent transition matrix 720 are unchanged When determining the part of speech ofa word in a query using the modified hidden Markov model 750, the modified Markov model 750 is applied to the words in the same manner as the hidden Markov model 700 is applied The use ofthe modified hidden Markov model in this manner provides better parsing results, and therefore better searching results, when the query applied to the search engine is a sentence fragment rather than a foil sentence In the preferred embodiment ofthe hidden Markov model 750, the value p is preferably set at 1/n where n is number of rows in the initial transition matrix 710 Relevance Normalization Referring now to Figs. 8A and 8B, there are shown the relevance normalization table 800 and the relevance normalization curve 850 for normalizing the relevance scores output in step 142j by the Conquest™ searching software. As previously described, when a search results list is retumed to the user station 102, each document listed on the search results list is provided with its own relevance score. The relevance score is an indication of how relevant the document is in accordance with the words in the query. For a given document identified during a query search, the relevance score is determined according to, for example, the proximity of query search words located in either the image-text information field 460 or text-type infoimation field 470 associated with the document.
It has been observed that the relevance scores determined by the Conquest™ searching software in step 142j are influenced by the number of words in the query. If the query is short, the relevance scores retumed are likely to be higher because it is easier for documents to contain all ofthe words ofthe query within a fixed size search window. If the query is long, it is less likely that a document will have all the words ofthe query within the fixed size search window, and when the probability of including all ofthe query words within the fixed size search window decreases, the computed relevance ofthe document decreases. Thus, in accordance with the method ofthe present invention, the relevance scores output by the Conquest™ searching software in step 142j are normalized to make the relevance scores independent of query length.
The graphical representation 800 is composed ofa number of range fields such as range fields 804a, b, c. The values within the range fields ofthe curve 800 correspond to the number of words in a query that were found outside a search window used to search the documents represented in the database 1 17. For a given document, as the number of words in the query falling outside the search window decreases, the relevance score assigned by the Conquest™ searching software to the document increases. For example, if none ofthe words are outside the search window, the document receives a high relevance score as indicated by the occurrence of range fields 804a containing zeros on the left hand side ofthe graphical representation 800. If one ofthe words in the query falls outside the search window the relevance decreases as shown by the occurrence of range fields 804b occurring in graphical representation 800 to the right of range fields 804 containing zeros. In relevance normalization table 800, the number of words in a query, ranging from one to ten, is shown on the vertical edge Each division along the horizontal edge of the table 800 corresponds to a window segment n Each window segment n has a range of relevance scores retumed by the searching software wherein n is the number of query words outside the search window The window segments range from wsO to ws9 along the horizontal axis ofthe table 800 The relevance scores ofthe relevance normalization table 800 are then curved to span the various relevance score ranges shown in table 800 These ranges are ranges defined by the numbers along the rows of table 850 within the window segments n The curving ofthe scores is performed in order to make the normalization provided to the user station 102 independent of query length in accordance with the graphical representation 850 wherein the relevance normalization curve 850 is merely a graphical mapping ofthe scores ofthe relevance normalization table 800
For example, consider the case where a query has a length of five words and a retumed document has a relevance between 49 and 59 In this example, assume two query words are outside the search window This example conesponds to row 5 of graphical representation 850 In row 5 of graphical representation 850, a string formed ofthe number 2, representing the two query words outside the window segment, spans the range of relevance values between approximately 49 and 59 Referring to the relevance normalization table 800, it can be seen that on row 5 the relevance scores below ws2 corresponding to window segment 2 are 79 and 75 This indicates that retumed relevance values between 49 and 59 are mapped into relevance scores in the range of 75 to 79 under these circumstances when relevance score normalization is performed The mapped values in the range of 75 to 79 are the ones provided to the user station 102
Input File Preparation and Loading
Referring now to Figs 9 and 9 A, there is shown a data preparation component 900 ofthe information retrieval system 100 ofthe present invention and a block diagram 980 of data flows within the data preparation component 900 The operation of data preparation component 900 is preferably performed within document preparation/accounting block 126 of data center 1 10 The data preparation component 900 receives information representative of documents from a plurality of publishers or document sources 112 such as, for example, image-type information 480 The document information received from the sources 112 is converted from the input format ofthe publisher 112 into the universal publisher processing format within the data preparation component 900 When the document information is converted into the universal publisher processing format, the data preparation component 900 stores the converted document information in the image/text database 118 for retrieval in response to a query from a user station 102. The publisher format conversion block 912 ofthe data preparation component 900 includes N publisher format conversion units 916 The publisher format conversion units 916 ofthe format conversion block 912 control the parsing and conversion ofthe input document information from the various publisher-specific formats into the universal publisher processing format supported by the infoimation retrieval system 100 The publisher format conversion block 912 therefore contains one publisher format conversion unit 916 for each ofthe N different publishers 112 in order to perform the conversion ofthe information provided by the publishers 112
Additional publisher format conversion units 916 can be added to the publisher format conversion block 912 as new publishers 112 are added to the list of publishers 1 12 that provide input information for storage and searching by the information retrieval system 100. Thus, the data preparation component 900 has a modularity that permits it to be easily modified to support any number of new publishers 1 12 as they are added to the information retrieval system 100 The input to the N different publisher format conversion units 916 ofthe publisher format conversion block 912 is the raw data provided by the individual publishers 112 to data center 1 10 The output ofthe publisher format conversion units 916 is a converted document suitable for processing, searching and retrieval within the information retrieval system 100 The publisher format conversion units 916 also provide a document information data structure in response to each input document A unique document identification number is also provided for each input document
The document information data structure is a data structure containing the document information obtained from the publisher 112, parsed by the format conversion units 916 and made available within data center 110 The elements included in the document infoπnation data structure provided by the publisher format conversion units 916 are based upon the processing and retrieval requirements ofthe information retrieval system 100 and the individual varying publisher formats that are received for conversion. The list of data structure elements is therefore not intended to be exhaustive. Additionally, the data structure element list can change as more publishers 112 and more publisher formats are added to the infoimation retrieval system 100. Not all elements in the document data structure are applicable to every document. The document information data structure provided by a publisher format conversion unit 916 can include, for example, in the case ofa novel, such elements as a name ofthe author ofthe novel, a parent title, such as the title ofthe novel, and child titles, for example, the titles ofthe chapters within the novel. There can be data stmcture elements for the Dewey decimal call number and the Library of Congress call number ofthe novel. There can also be a data stmcture element for the volume number ofthe novel ifthere are multiple volumes. Additionally, there can be data stmcture elements for the date of publication and the place of publication ofthe novel, as well as for the ISSN and ISBN. The document infoimation data stmcture can also include elements for indicating the number of charts or the number of diagrams set forth within the novel. Copyright infoimation can also be available in the document information data stmcture if the novel has been copyrighted.
There can also be a data stmcture element to indicate the number of pages in the novel, the size ofthe novel, and so on. It will be understood by those skilled in the art that any number of elements such as these can be present in the document information data stmcture.
As documents are received for conversion by the publisher format conversion units 916, they can be saved in temporary storage in order to increase document conversion throughput. The temporary storage of input documents lined up for reformatting by the document preparation component 900 is often best performed on a hard disk drive (not shown) because hard disk drives usually have a higher data transfer rate than the type of peripherals from which the input documents are commonly received from the publishers 112, such as CD-ROM, DAP, tape drives and floppy diskettes.
Each ofthe publisher format conversion units 916 transmits document infoimation to the image reformatting unit 904 and the text reformatting unit 908 within the document preparation component 900. The image reformatting unit 904 is responsible for converting graphics and video such as image-type information 480 in supported publisher image formats into the image format supported by the information retrieval system 100 ofthe present invention, and providing the converted graphics and video as image-type information 480. The image format supported by the information retrieval system 100 can be any conventional image format. For example, the image format supported by the infoimation retrieval system 100 can be the conventional JPEG file interchange format as described below. The image reformatting unit 904 includes image conversion routines that perform the operations required to reformat the graphical or video input image infoimation received from the publisher format conversion units 916 ofthe publishers 1 12 The image conversion routines ofthe image reformatting unit 904 can be obtained from public domain software packages. For example, the image conversion routines can include, for example, PBM Plus. The PBM Plus image conversion routine is an extended portable bit map tool kit known to those skilled in the art. The output of the publisher format conversion units 916 ofthe data preparation component 900 can have different formats depending on whether the data type ofthe document is image-type infoimation or text -type information. Converted image-type information 480 from the format conversion units 916 can be stored in one or more different formats in data center 110, including, for example, any ofthe conventional JPEG image file interchange formats known to those skilled in the art of infoimation processing. The conventional JPEG image format includes a standard image compression mechanism developed by the Joint Photograph Experts Group. The JPEG standard does not require any particular file format. The JPEG file interface format used in the information retrieval system 100 is a file format that permits JPEG files to be exchanged with the PC, MAC and UNIX workstation platforms. The JPEG file interface format ofthe information retrieval system 100 can conform to the JPEG Draft Intemational Standard (ISO DIS 109 18-1). The standard mechanism ofthe JPEG standard typically compresses image data with a compression ratio between 10: 1 and 20:1 without visible loss of image quality. Compression ratios between 30: 1 and 50: 1 are also possible with small to moderate loss of image quality using the JPEG standard. The JPEG standard is used for graphic data as well as image data within the data retrieval system 100.
The image reformatting unit 904 invoked when a publisher format conversion unit 916 within the publisher format conversion block 912 is provided with an image format converter function. The image format converter function serves as a gateway to the image reformatting unit 904. The format converter function gateway is responsible for routing unconverted images and graphics to the image reformatting unit 904 and for routing converted images and graphics from the image reformatting unit 904 When a format converter function gateway applies an unconverted input image to the image reformatting unit 904, the image reformatting unit 904 receives a file pointer The file pointer points to a file containing the input image to be converted The image reformatting unit 904 also receives the offset within the file to the image that is to be converted. In response, the image reformatting unit 904 retums a converted image 480 to the appropriate publisher format conversion unit 916 The retumed converted image 480 is suitable for storage in database 118
The publisher format conversion units 916 ofthe publisher format conversion block 912 also communicate with the text reformatting unit 908 within the data preparation component 900. The text reformatting unit 908 converts supported publisher text formats into the text format supported by the information retrieval system 100 The text format supported by the system 100 can be any conventional text format For example, the text reformatting unit 908 can support ASCII text format Additionally, the text reformatting unit 908 can provide support for tables and charts using ACSII characters Conventional public domain text conversion routines can be used to perform the text reformatting operations of the text reformatting unit 908 ofthe data preparation component 900 Additionally, the text reformatting unit 908 can include program blocks designed to receive input information in proprietary publisher data formats and convert the input information from the proprietary formats to the supported format The text reformatting unit 908 receives a file pointer from a publisher format conversion unit 916 within the publisher format conversion block 912 The pointer received by the text reformatting unit 908 points to the input file containing the text to be reformatted by the text reformatting unit 908 Additionally, the publisher format conversion unit 916 applies an offset for use by the text reformatting unit 908 in perfoiming the text reformatting operations The offset indicates the location within the file at which the text that is to be reformatted by the text reformatting unit 908 resides When the input text is reformatted, the text reformatting unit 908 retums the reformatted text to the appropriate publisher format conversion unit 916
The publisher format conversion units 916 within the publisher format conversion block 912 are also coupled to a text update unit 928 The text update unit 928 stores converted text infoimation in image/text database 1 18 in text-type infoπnation fields 460 and image text-type infoπnation fields 470 Each document or image can be stored in a subject category within the image/text database 1 18 in the preferred embodiment ofthe information retrieval system 100. The text update unit 928 receives a pointer to the document information data stmcture ofthe text (i.e . the image text-type information field 460 or text-type infoimation field 470), and the name ofa file (stored in field 417 of record 400) that contains the information that the information retrieval system 100 requires for indexing the document. Text update unit 928 uses this and other infoimation to build a record 400 for each text document processed by the unit
The text update unit 928 temporarily stores documents as the documents are received. It begins queuing the infoimation that is necessary to index the documents that are added to the system 100 When all documents for a data preparation session are added to the information retrieval system 100, the text update unit 928 invokes the index update unit 932 and passes to the index update unit 932 the necessary information to add the documents to the image/text database 1 18
The image update unit 924 also receives information from the publisher format conversion units 916 ofthe publisher format conversion block 912 within the data preparation component 900. The image update unit 924 stores a converted image document received by the publisher format conversion unit 916 from the image reformatting unit 904 in an image-type infoπnation field 480 in the image/text database 1 18 In the preferred embodiment ofthe information retrieval system 100, the image can be stored in subject category The image update unit 924 receives a pointer (stored in fields 432, 433 ofa record 430) corresponding the document infoπnation stmcture output by the publisher format conversion units 916 The image update unit 924 temporarily stores documents as they are received from the conversion units 916
Thus, as previously described, the infoimation retrieval system 100 can receive both (1) unconverted text-type information 908a representative of text, and (2) image-type information 480 representative of images The information retrieval system 100 can also store both types of information in the image/text database 118 In addition to image-type information 480 and conventional text-type information 908a, the information retrieval system 100 also receives unconverted image text-type infoimation 908b Image text-type infoπnation 908b is text infoimation that is associated with a specific item of image- type infoimation 480. For example, the image text-type infoimation 908b associated with an item of image-type infoπnation 480 can be a human readable caption associated with the image or a string of words associated with an image for the purpose of computer searching Additionally, some image-type information 480 can be received by the information retrieval system 100 without any associated image text -type information 908b and associated image text-type information 908b can be generated at data center 1 10 for association with the image-type information 480 prior to storage of image-text type infoimation field 460 and the image type information field 480 in the image/text database 118
Thus, in accordance with the method ofthe present invention, the image-type information 480 received is stored in the correct image format by the image update unit 924 Additionally, both the text-type information 470 and the image text-type information 460 associated with the image-type information 480 are stored by the text update unit 928 in the image/text database 118. The text-type information 470 and the image text-type information 460 are stored in the image/text database 118 in the same text infoimation format after conversion by the data preparation component 900 Because they are stored in a common text information format, the image text-type information 460 associated with the image and the text-type infomiation 470 not associated with an image are not distinguishable from each other in database searches, and they can be searched by the information retrieval system 100 at the same time in response to a single query Images located using image text information in this manner can be referred to as dependent images When image text-type information 460 is located by a search within the information retrieval system 100, the image-type information 480 associated with the located image text-type information 460 can be retrieved and provided to the user station 102 in accordance with the located image text-type information 460 If the user statron 102 has windows software, the image and the text can be displayed in a multiple window 347, 348 or copied into a single window 348a The image text-type information 460 associated with the image-type information 480 that is used in the search process to locate an image is not necessarily displayed on the user station 102
It will be understood by the those skilled in the art that a similar storage and retrieval method can be provided for audio-type information The audio-type information can be digitally stored in a system 100 database in a conventional audio information format The audio-type information can also be provided with associated audio text information The audio text information can be stored by the text update unit 928 in the same manner as previously described with respect to the image text information. Thus^ text-type infoimation not associated with either image infoimation or audio information, the image text information associated with image information, and the audio text information associated with audio information are all stored in a single common infoimation format in the image/text database 1 18. Additionally, they are all searched simultaneously within the infoimation retrieval system 100 in response to a single query from the user station 102. When the audio text information is identified by a search, the associated audio information associated with it can be selected can be retrieved.
The publisher format conversion units 916 are also coupled to a readability analysis unit 920. The readability analysis unit 920 produces an index of readability for documents processed by the data preparation component 900. The readability analysis unit 920 can also provide a value corresponding to the scholastic grade level of an input document from a publisher 112. Once calculated by the data preparation component 900, the readability index is stored permanently as an attribute ofthe document in field 410 of a record 400. For documents that are updated regularly, for example, newspapers, the readability index is assumed to be the same for all issues. Thus, once calculated by the readability analysis unit 920 for a single issue or volume, the readability index value is promulgated to all subsequent issues ofthe document.
In the prior art, for example, in the Fleish-Kincaid method, it is known to base the readability index upon parameters such as average word syllable length or average sentence length in a document. However, these prior art methods require significant amounts of look-up tables or parsing. Therefore, in the system ofthe present invention, the readability index is preferably determined based upon the rate at which new terms are introduced into the text as well as by the distribution of words found in the documents compared to distributions of categorized text.
An index update unit 932 is also coupled to the publisher format conversion block 912 ofthe data preparation component 900. The index update unit 932 updates the document indexes within the document index 117 when a new document or group of documents is added to the data center 110. Additionally, the index update unit 932 updates the document index database 117 when documents are purged from the data center 110.
The index update unit 932 updates the document information directory table. When purging documents from the data center 110, the index update unit 932 deletes the document text or image from the document infoimation directory and/or dependent image tables in database 118, and marks the document as deleted in the index system
Automatic Document Subjecting Svstem Documents stored in the image/text database 118 are divided into a number of different subject categories The user ofthe information retrieval system 100 is permitted to specify a subject or subjects for search in response to a user query For example, the user may wish to search only sports literature or only art literature in the image/text database 118 This avoids searching through documents in the database 118 that are unlikely to be of interest to the user In order to provide this capability to the user station 102, the input documents applied to the information retrieval system 100 must be categorized or subjected after they arrive from the publishers 1 12 Categorizing of input documents can be a very laborious and time consuming process when performed by a human viewer because a very large number of incoming documents can be added regularly to the data center 110 Additionally, determinations can be made for each input document whether it should be categorized in more than one subject category
Referring now to Fig 10, there is shown an automatic docket subjecting system 1000 The automatic document subjecting system 1000 receives documents processed by the data preparation component 900 and stored in a document pool 1004 The automatic document subjecting system 1000 automatically determines the subject category of the received documents without any review ofthe received document by a human viewer It will be understood that the automatic document subjecting performed within the data center 1 10 thus eliminates the need for the laborious and time consuming subjecting operations performed in the prior art wherein humans reviewed the input documents received by an information retrieval system in order to determine the subject categories of each received document for storage in a system database such as the image/text database 118
The automatic document subjecting system 1000 includes a plurality of individual subject databases 1016 which can reside within the image/text database 1 18 ofthe image retrieval system 100 Each subject database 1016 contains documents relevant to a separate subject category For example, one subject database 1016 can contain only documents relating to the subject of art Another subject database 1016 can contain only documents relating to the subject of sports In the preferred embodiment ofthe invention, there can be any number of different subject categories, each having its own subject database 1016. Subject categories have also been provided, for example, for science, religion, health, education, geography and recipes.
The automatic document subjecting system 1000 also includes an individual subject lexicon 1020 coupled to each subject database 1016. Each subject lexicon 1020 includes a list of classifier words whose occurrence in a document has been determined to be an indicator that the document is likely to be classified in the subject category corresponding to the subject lexicon 1020. The behavior ofthe classifier words included in the subject lexicons 1020 can be determined according to a statistical analysis described hereinbelow. In addition to the classifier words, the subject lexicon 1020 includes a discriminator weight for each classifier word in the subject lexicon 1020. The discriminator weight, as described in more detail hereinbelow, provides a measure ofthe difference between the behavior ofa classifier word in a subject database 1016 and its behavior in the overall document pool 1004 which contains documents from all subject categories 1016. The automatic document subjecter system 1000 includes an automatic subjecting block 1012. The automatic subjecting block 1012 performs a comparison ofthe input documents from the publishers 112, stored in the document pool 1004, with the various subject lexicons 1020 in order to categorize them into their correct subject category. In the autosubjecting process, the words ofthe input document in the document pool 1004 to be subjected are compared by the automatic subjecter block 1012 with the classifier words ofthe various subject lexicons 1020 by way ofthe word comparison line 1022. Based upon this comparison, the input documents stored in the document pool 1004 are stored in selected subject databases 1016 by way ofthe appropriate subject category line 1018.
In order to determine whether a document stored in the document pool 1004 should be stored in a subject database 1016, a determination is made whether each classifier word in the subject lexicon 1020 ofthe subject database 1016 occurs in the document. More particularly, for each subject database 1016, a sum ofthe discriminator weights of each classifier word occurring in the document is determined. Thus, for a given document, several separate sums will be determined, each of which corresponds to the sum of discriminator weights determined for a particular subject database 1016. The magnitude of the sum corresponding to a particular database 1016 represents a confidence score which reflects how strongly the document relates to the subject represented by the particular database 1016. Accordingly, for a given document, there will be a plurality of confidence scores, each of which corresponds to a particular subject database 1016. In a preferred embodiment, the determination as whether to include a document in a subject database 1016 is based on a comparison ofthe various confidence scores determined for the document. For example, in one embodiment, if a confidence score associated with a particular subject exceeds the confidence score associated with all other subjects by 30% or more, then the document is determined as being associated with the particular subject and is selected for inclusion in the database 1016 associated with the particular subject. It will be understood by those skilled in the art that other thresholds may be used in comparing the various confidence scores to determine whether to select a document for inclusion in a particular subject database 1016.
Referring now to Figs. 11 and 12, there are shown the subject lexicon generation process flow 1100 and the subject lexicon generation system 1200 ofthe present invention. The subject lexicon generation process flow 1100 and the subject lexicon generation system 1200 describe the generation ofthe subject lexicons 1020 ofthe automatic document subjecting system 1000 in order to permit the automatic subjecting system 1000 to receive input documents from the publishers 112 stored in the document pool 1004, and automatically determine a subject area category for the received input documents. When a subject area of an input document is determined using a subject lexicon 1020 generated by the subject lexicon generation process flow 1100 and the subject lexicon generation system 1200, the input document can be sorted into a selected subject database 1016 as previously described.
In order to generate a subject lexicon 1020 for a selected subject category, a miniature corpus 1232 including a small number of documents, for example five or ten documents, known to be typical ofthe subject category is selected by a human viewer as shown in block 1104 ofthe process flow 1100. This selection can be performed using the query program 1204 to select the small number of subjected documents from the document pool 1004. The documents selected from the document pool 1004 by the query program 1204 thus form the miniature corpus 1232. The miniature corpus 1232 ofthe selected subject can be referred to as the seed 1232 because a subject lexicon 1020 is developed from it by the system 1200 following the process flow 1100. Statistics on the words in the selected documents ofthe miniature corpus 1232 are determined as shown in block 1108. The statistics include, for example, the frequency of occurrence of each word in the miniature corpus, and the number of documents in the miniature corpus in which each such word appears. The generation of these statistics is performed by the buildjexs program 1236. As shown in block 1 112, a preliminary lexicon 1238 is developed by determining the words most likely to appear in the documents ofthe miniature corpus 1232 based upon the statistics determined by the buildjexs program 1236. The buildjexs program 1236 can build a preliminary lexicon 1238 for each subject category database 1016 in this manner. The buildjexs program 1236 then compares the behavior ofthe words in the miniature corpus 1232 ofthe selected subject area, as determined in block 1108, with the behavior of the same words in the overall document pool 1004. In order to permit this comparison, the pool stats determine program 1208 determines the statistics ofthe words in the document pool 1004. The statistics determined by the pool stats determine program 1208 include the frequency of occurrence ofthe words and the number of documents in which each word appears. These statistics are stored in pool statistics 1008. Since these statistics are required on a very large number of different words, it may be convenient to perform the tasks in segments and combine the results using the comb_stats program 1212. These statistics in pool_stats 1008 are applied to the buildjexs program 1236 by way ofthe line 1210.
A discriminator weight is determined by the buildjexs program 1236 for each classifier word (or unigram) in the subject lexicon 1020. The discriminator weight ofa word is equal to log (PWmc/PWc), where PWmc is the probability of the word in the miniature corpus 1232, and PWc is the probability ofthe word being in the document pool 1004. The discriminator weights calculated in this manner are stored in the subject lexicon 1020 for each classifier word. In this manner the subject lexicon generation system 1100 and the subject lexicon generation system 1200 determine classifier words ofthe various subject lexicons 1020. Additionally, a discriminator weight can be determined for "bigrams" that can act as classifier words in the manner described above with respect to unigrams. Bigrams are word pairs that occur together in documents and can therefore provide useful information about the subject area ofthe subject database 1016. The discriminator weights for bigrams are calculated using the mutual information formula set forth in equation (1) below:
MI (a,b) = log (P(ab)/P(a)P(b)) ( 1 ) where P(a) is the probability that word a of a bigram occurs in the document pool 1004, P(b) is the probability that word b ofthe bigram appears in the document, and P(ab) is the probability that words a and b ofthe bigram appear next to each other in a document ofthe subject miniature corpus. If word a and word b are statistically independent, then MI (a,b) = 0. Altematively, if word a and word b are statistically related, then MI(a,b) > 0.
A determination is made at step 1116 whether the construction ofthe subject lexicon 1020 is complete. On the first pass through the subject lexicon generation process flow 1 100, path 1120 is followed. The preliminary lexicon 1238 ofthe selected subject area is then applied to the subj_pool program 1216 by way of line 1214. As shown in block 1124 ofthe subject lexicon generation process flow 1100 a rough corpus 1220 of documents is selected from the documents in the document pool 1004 by the subj_pool program 1216. The documents selected by the subj_pool program 1216 are the documents in the document pool 1004 that have the highest confidence of being in the selected subject area, as determined using the classifier words and confidence levels ofthe preliminary lexicon 1238 built by the buildjexs program 1236 in the manner described with respect to auto-subjector 1012. The confidence threshold used by the subj_pool program is preferably the same as that described above for auto-subjector 1012. Thus, if a confidence score associated with a particular subject exceeds the confidence score associated with all other subjects by 30% or more, then the document is determined as being associated with the particular subject by the subj_pool program. In the preferred embodiment ofthe subject lexicon generation system 1200, the subj_ pool program 1216 is provided with adjustable confidence criteria. The rough corpus 1220 can include, for example, two-hundred documents from the document pool 1004. A human reviewer then manually reviews the documents selected by the subj_ pool program 1216 for inclusion in the rough corpus 1220. The human reviewer can use the chooser program 1224 for this review process. The human reviewer eliminates documents selected for inclusion in the rough corpus 1220 using the preliminary lexicon 1238 if the human viewer determines that the reviewed document does not belong in the selected subject. This purging operation for removing documents incorrectly selected using the preliminary lexicon 1238 is set forth in block 1128 ofthe subject lexicon generator process flow 1 100. Removal ofthe incorrectly included documents ofthe rough corpus 1220 provides the refined corpus 1228.
The refined corpus 1228 is then operated upon by the buildjexs program 1236 in the manner described previously with respect to the operation of the buildjexs program 1236 upon the miniature corpus 1232. In accordance with the refined corpus 1228, the buildjexs program 1236 can produce a new preliminary lexicon 1238. The new preliminary lexicon 1238 can be used as a subject lexicon 1020 in the automatic document subjecting system 1000 if a determination is made in step 1116 that lexicon generation is complete and path 1122 followed. Altemately, path 1120 can be repeated and the new preliminary lexicon 1238 can be operated upon by the subj_pool program 1216 to provide a new rough corpus 1228. The loop through the subject lexicon generation system 1200 including the path 1 120 can be repeated in this manner any number of times. It is believed that the optimum number of repetitions in this manner is two or three. It is also believed that the use of bigrams for selecting documents is most useful in the first pass through path 1120.
Royalty Database Management In order to perform royalty and marketing information management within the information retrieval system 100, the accounting manager 138 ofthe data center 110 provides generalized reporting capabilities with respect to a wide range of accounting database information within the accounting database 1 19 (shown in Figure 13). The accounting database information within the accounting database 119 is organized, for example, into databases 1 19a, b, d, e, and g, as described below, in order to facilitate report generating and management functions at the data center 110. For example, the accounting database 1 19 ofthe data center 1 10 includes a subscriber profile database 119a for storing subscriber profile information. The subscriber profile infoimation in the subscriber profile database 119a can include a subscriber identification, a subscriber name, address, telephone number, date of birth, and gender. The subscriber profile information can also include the number of siblings ofthe subscriber, by gender, the occupation ofthe mother and father of the subscriber, and any foreign language spoken by the subscriber. Additionally, the subscriber profile information can include the favorite subjects in school, the hobbies, the extracurricular activities and the favorite magazines ofthe subscriber. The college attendance plans and possible future profession ofthe subscriber can also be included in the subscriber profile information stored in the database 119a. The subscriber profile information in the subscriber profile database 119a can also include the configuration ofthe computer used by the subscriber. The accounting database 119 also includes an accounting records database 119b containing a plurality of accounting records 119c. The accounting record 1 19c within the accounting database 119b can contain a document identification, a publisher identification and a subscriber identification. Additionally, the accounting record 119c contains a time stamp indicating the time of access ofthe document by a user station 102 and a link to the subscriber profile infoimation in the subscriber profile database 119a. The accounting records 119c permit the performance of accounting and marketing functions within the data center 1 10. One accounting record 1 19c is stored within the accounting records database 119b for each document that is received by a user station 102 in response to a query. Multiple retrievals ofthe same document in response to a single query by the user station 102 result in a single accounting record 119c within the accounting records database 1 19b. Multiple retrievals ofa document in response to different queries result in the addition of multiple accounting records 119c to the accounting records database 119b. An accounting record 119c for a document can be added to the accounting records database 119b within the account database 119 at the time the retrieved document is transferred to the user station 102.
Additionally, accounting database 119 includes a publisher information database 119d for storing publisher database information. The publisher infoimation database 119d includes a publisher information record 440, as previously described, for storing infoimation on each publisher 112 that provides documents to the information retrieval system 100. The infoimation stored for a publisher 112 in a publisher information record 440 can include fields such as a publisher identification field 441, a name field 442, and address fields 443, 444, a contact name field 449, a contact title, telephone number field 451 and address field 450 associated with the publisher 112 can also be stored in the publisher information record 440. Additionally, the publisher database infoimation record 440 can contain the royalty percentage and the royalty amount by month ofthe publisher 112. Each time a new publisher 112 is added to the information retrieval system 100, a new publisher information record 440 is added to the publisher infoimation database 119d.
Using the infoπnation in the subscriber profile database 119a, the accounting records 119c within the accounting records database 119b and the publisher information records 440 within the publisher information database 119d ofthe accounting database 119, the accounting manager 138 can perform various royalty, marketing and auditing operations. Using this infoπnation, data center 1 10 can provide reports on document usage, reports on queries that result in no hits or in low relevance hits, reports on the categories of users and reports on the demographics ofthe various users. The data center 110 can also provide time profiles such as time profiles per session, time profiles per week and time profiles on the think time required by the users. It is also possible for the accounting manager 138 ofthe information retrieval system 100 to determine and report what queries are submitted by user stations 102, a list of documents retumed to each user station 102 in response to each query submitted, and the relevance of each document retumed for the queries. The document retrieval and the user feedback information can also be determined and reported by the accounting manager 138.
The royalty payment computations of royalty database information management require a generalized reporting capability ofthe accounting database 119. For example, the accounting records 119c within the accounting records database 119b can be summarized and sorted according to a number of different criteria. The criteria include for example, publisher 1 19, document identification number, date and frequency of access.
A query log table database 119e within the accounting database 119 is used by the data center 110 for maintaining a history of queries made by users. The query log table database 1 19e contains a query log entry 119f for each query applied to the information retrieval system 100 by a user station 102. Each query log entry 1 19f identifies the raw natural language and fields, if any, entered by the user to perform a query. The query log entry 119f also includes the date and time ofthe query as well as the number of matches that resulted from the query. In addition, each log entry 119f cross-references the user identification number ofthe user that entered the query into the information retrieval system 100. A new query log entry 119f is added to the query log table database 119e each time a new query is received. When the new query is received, all attributes ofthe query, except the number of documents matching the query, are placed into the query log entry 1 19f. The query attribute indicating the number of documents matching the query is incremented for each document found by the query server 1 16, provided the document found meets a minimum relevance threshold.
A retrieval account table database 119g is also provided within the accounting database 1 19 ofthe data center 1 10. The retrieval account table database 1 19g is used by the infoimation retrieval system 100 to log all document retrievals performed in response to queries When a document retrieval is initiated within the data center 1 10, a document retrieval entry 119h inserted in the document retrieval account table database 119g and the time and date ofthe retrieval is recorded in the document retrieval entry 1 19h Each document retrieval 1 19h is a record that stores information identifying the document retrieved by cross-referencing the document information directory table Each document retrieval entry 119h also identifies the publisher 112 ofthe document by cross-referencing the publisher infoimation database 119d The reference to the publisher information database 119d is used to credit the publisher for the retrieval for the purpose of royalty management Each document retrieval entry 119h also cross-references the query and the user identification number ofthe user that issued the query for forther accounting purposes using the accounting database 119 The attributes ofthe retrieval account table entries 119h include the document identification number, the publisher identification number, the subscriber identification number, a query identification number and the retrieval time In order to start an account reporting session for generating reports using the infoimation in the accounting database 119, the report type must be specified For example, a publisher retrieval account report can be specified Additionally, the starting date and the ending date ofthe report must be specified These reports include documents that accumulate and present any ofthe infoimation set forth in any ofthe data stmctures and databases set forth
A plurality of marketing reports can be generated For example, reports indicating subscriber profile statistics such as the average age ofthe users or the percentage of users of each gender can be generated In addition, for each document retrieved by document retrieval system 136, a report indicating the demographical makeup ofthe users that selected the document for retrieval can be generated Subscriber computer statistics such as the percentage of subscribers with different kinds of computers and common modem speeds can be provided Subscriber/query statistics such as the average number of queries per user and the average number of queries per gender can be obtained System utilization statistics are also available For example, retrieval daily peaks and valleys and the average number of retrievals per day can be determined
Furthermore, it is to be understood that although the present invention has been described with reference to a preferred embodiment, various modifications, known to those skilled in the art, may be made to the stmctures and process steps presented herein without departing from the invention as recited in the several claims appended hereto.

Claims

What is claimed is
1. A method for identifying textual documents and multi-media files corresponding rch topic, comprising the steps of
(A) storing document records each of which is representative of one ofa plurality of textual documents, said document records having text information fields associated therewith, each of said text information fields representing text from one of said plurality of textual documents,
(B) storing multi-media records each of which is representative of one ofa plurality of multi-media files, said multi-media records having multi¬ media information fields for representing only digital video or audio information and associated text fields, each of said associated text fields representing text associated with one of said multi-media information fields, (C) receiving a single search query corresponding to said search topic,
(D) searching an index database in accordance with said single search query to simultaneously identify document records and multi-media records related to said single search query, said index database having a plurality of search terms corresponding to terms represented by said text information fields and said associated text fields, said index database including a table for associating each of said document and multi-media records with one or more of said search terms,
(E) generating a search result list having entries representative of both textual documents and multi-media files related to said single search query in accordance with said document records and said multi-media records identified in step (D),
(F) retrieving text corresponding to said search topic by selecting entries from said search result list representing selected document records to be retrieved, and then retrieving text represented by text information fields associated with said selected document records, and
(G) retrieving digital video or audio infoimation corresponding to said search topic by selecting entries from said search result list representing selected multi-media records to be retrieved, and then retrieving digital video or audio infoimation represented by multi¬ media information fields associated with said selected multi-media records. 2. The method of claim 1, wherein said document records and said multi-media records are formed from header files stored in a single common format on said database.
3. The method of claim 2, wherein said multi-media records include a plurality of still image records each of which is representative ofa still image.
4. The method of claim 3, wherein said multi-media records include a plurality of motion video records each of which is representative ofa sequence of motion video frames.
5. The method of claim 4, wherein said multi-media records include a plurality of digital audio records each of which is representative of a sequence of digital audio frames.
6. The method of claim 1, wherein step (E) further comprises the step of relevance ranking said document and multi-media records identified in step (D) by generating a relevance score corresponding to each of said entries in said search result list.
7. The method of claim 6, wherein step (E) forther comprises the step of foiming a relevance ordered search result list by ordering said entries in said search result list in accordance with said relevance ranking such that an entry with a highest relevance ranking represents a first entry on said relevance ordered search result list.
8. The method of claim 7, wherein entries corresponding to said document records identified in step (D) and entries corresponding to said multi-media records identified in step (D) are interspersed within said relevance ordered search result list.
9. The method of claim 1, wherein said single search query is in a natural language format.
10. An apparatus for identifying textual documents and multi-media files corresponding to a search topic, comprising:
(A) means for storing document records each of which is representative of one of a plurality of textual documents and multi-media records each of which is representative of one ofa plurality of multi-media files, said document records having text infoimation fields associated therewith, each of said text information fields representing text from
-- - one of said plurality of textual documents, said multi-media records having multi-media information fields for representing only digital video or audio infoimation and associated text fields, each of said associated text fields representing text associated with one of said multi-media information fields,
(B) means for receiving a single search query corresponding to said search topic;
(C) searching means, coupled to an index database and said means for receiving said single query, for searching said database in accordance with said single search query to simultaneously identify document records and multi-media records related to said single search query, said index database having a plurality of search terms corresponding to terms represented by said text information fields and said associated text fields, said index database including a table for associating each of said document and multi-media records with one or more of said search terms;
(D) search result list generation means, coupled to said searching means, for generating a search result list having entries representative of both textual documents and multi-media files related to said single search query in accordance with said document records and said multi-media records identified by said searching means,
(E) means for receiving signals representing selected document records and selected multi-media records identified on said search results list,
(F) first means for retrieving, from said means for storing, text represented by text information fields associated with said selected document records; and
(G) second means for retrieving, from said means for storing, digital video or audio infoimation represented by multi-media information fields associated with said selected multi-media records 1 1 The apparatus of claim 10, wherein said document records and said multi¬ media records are formed from header files stored in a single common format on said database 12 The apparatus of claim 11, wherein said multi-media records stored on said database include a plurality of still image records each of which is representative ofa still image.
13 The apparatus of claim 12, wherein said multi-media records stored on said database include a plurality of motion video records each of which is representative ofa sequence of motion video frames
14 The apparatus of claim 13, wherein said multi-media records stored on said database include a plurality of digital audio records each of which is representative ofa sequence of digital audio frames 15 The apparatus of claim 10, wherein said search result list generating means includes means for relevance ranking said document and multi-media records identified by said searching means by generating a relevance score corresponding to each of said entries in said search result list
16 The apparatus of claim 15, wherein said result list generating means forther comprises means for forming a relevance ordered search result list by ordering said entries in said search result list in accordance with said relevance ranking such that an entry with a highest relevance ranking represents a first entry on said relevance ordered search result list
17 The apparatus of claim 16, wherein entries corresponding to said document records identified by said searching means and entries corresponding to said multi-media records identified by said searching means are interspersed within said relevance ordered search result list
18 The apparatus of claim 10, wherein said single search query is in a natural language format
19 A computer-implemented method for composing a composite document on a selected topic from a plurality of information sources by searching said plurality of information sources and identifying, displaying and copying files corresponding to said selected topic, comprising the steps of
(A) storing records each of which is representative of one of a plurality of infoπnation files in a database, (B) receiving a single search query corresponding to said search topic,
(C) searching said database in accordance with said single search query to identify records related to said single search query, (D) generating a search result list having entries representative of information files identified during said searching step and displaying said search result list in a first display window open on a user display;
(E) receiving signals representative of at least first and second selected entries from said search result list, said first and second selected entries respectively corresponding to first and second information files from said plurality of information files;
(F) opening a second display window on said user display, different from said first display window, for displaying at least a portion of said first information file;
(G) opening a third display window on said user display, different from said first and second display windows, for displaying at least a portion of said second information file; and
(H) opening a document composition window on said user display, different from said first, second and third display windows, for receiving said portion of said first infoimation file and said portion of said second information file; and (I) composing said composite document by copying said portion of said first information file from said second display window to said document composition window and copying said portion of said second information file from said third display window to said document composition window.
20. The method of claim 19, wherein said plurality of information files represented by said records stored in step (A) include a first group of information files representing textual documents.
21. The method of claim 20, wherein said plurality of information files represented by said records stored in step (A) include a second group of infoimation files representing still images, said first information file is representative of a selected textual document, said second infoimation file is representative ofa selected still image, and said composite document composed in step (G) is formed by copying said selected textual document and said selected still image from said second and third windows, respectively, to said document composition window. 22. The method of claim 21, wherein said plurality of information files represented by said records stored in step (A) include a third group of infoimation files each of which is representative ofa sequence of motion video frames; step (E) further comprises receiving signals representative ofa third selected entry from said search result list, said third selected entry corresponding to a third information file from said plurality of information files and being representative ofa selected sequence of motion video frames; step (G) further comprises, while said first display window remains open, opening a fourth display window on said user display, different from said first, second and third display windows, for displaying at least a portion of said selected sequence of motion video frames; and step (H) forther comprises forming said composite document by copying said selected textual document, said selected still image, and at least a portion of said selected sequence of motion video images, from said second, third and fourth windows, respectively, to said document composition window.
23. The method of claim 22, wherein said plurality of infoimation files represented by said records stored in step (A) include a fourth group of infoimation files each of which is representative ofa sequence of digital audio frames; step (E) forther comprises receiving signals representative ofa fourth selected entry from said search result list, said fourth selected entry corresponding to a fourth infoimation file from said plurality of infoimation files and being representative ofa selected sequence of digital audio frames; step (G) forther comprises playing at least a portion of said selected sequence of digital audio frames; and step (H) forther comprises forming said composite document by copying said selected textual document, said selected still image, at least said portion of said selected sequence of motion video images, and at least said portion of said selected sequence of digital audio frames to said document composition window.
24. The method of claim 19, wherein step (H) comprises the steps of: (i) copying said portion of said first information file from said second display window to a clipboard window and copying said portion of said second infoimation file from said third display window to said clipboard window, and (ii) composing said composite document by copying said portion of said first infoimation file from said clipboard window to said document composition window and copying said portion of said second infoimation file from said clipboard window to said document composition window 25 The method of claim 19, wherein step (D) forther comprises the step of relevance ranking said records identified in said searching step by generating a relevance score corresponding to each of said entries in said search result list 26 The method of claim 25, wherein step (D) further comprises the step of forming a relevance ordered search result list by ordering said entries in said search result list in accordance with said relevance scores such that an entry with a highest relevance score represents a first entry on said relevance ordered search result list
27 The method of claim 26, wherein entries conesponding to said document records identified in step (C) and entries corresponding to said multi-media records identified in step (C) are interspersed within said relevance ordered search result list
28 The method of claim 27, wherein said multi-media records identified in step (C) include files representing still images
29 The method of claim 28, wherein said multi-media records identified in step (C) forther include files representing sequences of motion video frames, and files representing sequences of digital audio frames
30 The method of claim 19, wherein said smgle search query is in a natural language format
31 An apparatus for composing a composite document on a selected topic from a plurality of information sources by searching said plurality of information sources and identifying, displaying and copying files corresponding to said selected topic, comprising
(A) a database for storing records each of which is representative of one ofa plurality of information files,
(B) means for receiving a single search query corresponding to said search topic,
(C) searching means, coupled to said means for receiving said single search query and to said database, for searching said database in accordance with said single search query to identify records related to said single search query;
(D) search result list generating means, coupled to said searching means, for generating a search result list having entries representative of information files identified by said searching means;
(E) display means, coupled to said search result generating means, for displaying said search result list in a first display window open on a user display;
(F) file retrieval means for receiving signals representative of at least first and second selected entries from said search result list, said first and second selected entries respectively corresponding to first and second information files from said plurality of information files, and for retrieving said first and second information files;
(G) display window opening means, coupled to said file retrieval means, for opening a second display window for displaying at least a portion of said first information file, a third display window for displaying at least a portion of said second infoimation file and a document composition window for displaying at least a portion of said second information file, said second display window being different from said first display window, said third display window being different from said first and second display windows, and said document composition window being different from said first, second and third display windows; and (H) composing means for composing said composite document by copying said portion of said first information file from said second display window to said document composition window and copying said portion of said second infoimation file from said third display window to said document composition window.
32. The apparatus of claim 31, wherein said plurality of information files represented by said records stored in said database include a first group of information files representing textual documents
33. The apparatus of claim 32, wherein said plurality of information files represented by said records stored in said database include a second group of infoimation files representing still images, said first infoimation file is representative of a selected textual document, said second infoimation file is representative of a selected still image, and said composing means includes means for forming said composite document by copying said selected textual document and said selected still image from said second and third windows, respectively, to said document composition window
34 The apparatus of claim 33, wherein said plurality of infoimation files represented by said records stored in said database include a third group of infoimation files each of which is representative of a sequence of motion video frames, said file retrieval means forther comprises means for receiving signals representative ofa third selected entry from said search result list, said third selected entry corresponding to a third information file from said plurality of infoimation files and being representative ofa selected sequence of motion video frames, said display window opening means further comprises means for opening a fourth display window on said user display, different from said first, second and third display windows, for displaying at least a portion of said selected sequence of motion video frames while said first display window remains open, and said composing means includes means for forming said composite document by copying said selected textual document, said selected still image, and at least a portion of said selected sequence of motion video images, from said second, third and fourth windows, respectively, to said document composition window
35 The apparatus of claim 34, wherein said plurality of information files represented by said records stored in said database include a fourth group of infoimation files each of which is representative ofa sequence of digital audio frames, said file retrieval means further comprises means for receiving signals representative ofa fourth selected entry from said search result list, said fourth selected entry corresponding to a fourth information file from said plurality of information files and being representative of a selected sequence of digital audio frames, said apparatus forther comprising (I) means for playing at least a portion of said selected sequence of digital audio frames, wherein said composing means includes means for forming said composite document by copying said selected textual document, said selected still image, at least said portion of said selected sequence of motion video images, and at least said portion of said selected sequence of digital audio frames to said document composition window
36 The apparatus of claim 31 , wherein said composing means comprises (i) means for copying said portion of said first information file from said second display window to a clipboard window and copying said portion of said second infoimation file from said third display window to said clipboard window, and (ii) means for composing said composite document by copying said portion of said first infoimation file from said clipboard window to said document composition window and copying said portion of said second information file from said clipboard window to said document composition window 37 The apparatus of claim 31 , forther comprising relevance ranking means, coupled to said searching means for relevance ranking said records identified by said searching means by generating a relevance score corresponding to each of said entries in said search result list
38 The apparatus of claim 37, wherein said search result list generating means is coupled to said relevance ranking means, said search result list generating means forther including means for forming a relevance ordered search result list by ordering said entries in said search result list in accordance with said relevance scores such that an entry with a highest relevance score represents a first entry on said relevance ordered search result list
39 The apparatus of claim 38, wherein entries corresponding to said document records identified by said searching means and entries corresponding to said multi-media records identified by said searching means are interspersed within said relevance ordered search result list
40 The apparatus of claim 39, wherein said multi-media records identified by said searching means include files representing still images
41 The apparatus of claim 40, wherein said multi-media records identified by said searching means forther include files representing sequences of motion video frames, and files representing sequences of digital audio frames
42 The apparatus of claim 31, wherein said single search query is in a natural language format
43 A method for storing information in an information retrieval system having a database for retrieval of said input information in response to a query, comprising the steps of
(A) receiving text infoimation representative of text,
(B) receiving image infoimation representative of an image,
(C) receiving image text information representative of text associated with said image, (D) storing said image infoimation in an image information format,
(E) storing said text infoimation and said image text infoπnation in a common text infoimation format whereby the format of said stored text infoimation is identical to the format of said stored image text information,
(F) searching said text information and said image text infoπnation in said common text infoπnation format and retrieving said text information and said image text information in response to a single query, and
(G) retrieving said image information associated with said retrieved image text information whereby said text infoimation and said image infoπnation are retrieved in accordance with said single query 44 The method for storing information in an infoimation retrieval system of claim
43, wherein said infoimation retrieval system includes a video display system, forther comprising the step of displaying said retrieved text information and said retrieved image information on said display system
45 The method for storing information in an information retrieval system of claim 44, wherein said video display system includes display system windows, forther comprising the step of displaying said retrieved text information and said retrieved image information in a single video display system window
46 The method for storing infoimation in an infoπnation retrieval system of claim 43, forther comprising the step of (H) receiving audio information representative of an audio sound,
(I) receiving audio text information representative of text associated with said audio infoimation,
(J) storing said audio information in an audio information format, (K) storing said audio text infoimation in said common text information format whereby the format of said stored audio text information is identical to the format of said stored image text information; and
(L) retrieving said audio text information in response to said single query. 47. The method for storing information in an information retrieval system of claim
46, forther comprising the step of selecting said audio information associated with said retrieved audio text information.
48. The method for storing infoimation in an infoπnation retrieval system of claim 43, forther comprising the steps of viewing said image and generating said image text information in response to said viewing.
49. A method for storing information on an information retrieval system having a database for retrieval of said input infoπnation in response to a query, comprising the steps of:
(A) receiving text information representative of text; (B) receiving audio information representative of an audio;
(C) receiving audio text information representative of text associated with said audio information;
(D) storing said audio infoimation in an audio infoπnation format;
(E) storing said text information and said audio text infoπnation in said database in a common text information format whereby the format of said stored text information is identical to the format of said stored text information;
(F) searching said text information and said audio text information in said common text infoπnation format and retrieving said text information and said audio text information in response to a single query; (G) selecting said audio information associated with said retrieved audio text information; and
(H) retrieving said selected audio information whereby said text information and said audio information are retrieved in accordance with the same query.
50. The method for storing information in an information retrieval system of claim 49, forther comprising the steps of:
(I) receiving image infoimation representative of an image;
(J) receiving image text information representative of text associated with said image;
(K) storing said video information in a video information format, (L) storing said image text information in said common text infoπnation format, and (M) retrieving said image text information in response to said single query
51. A system for storing information in an information retrieval system having a database for retrieval of said input information in response to a query, comprising text infoimation representative of text, image information representative of an image, image text information representative of text associated with said image, an image infoimation format for storing said image information, a common text infoimation format for storing said text information image text information whereby the format of said stored text information is identical to the format of said stored image text infoimation, a search engine for searching said text information and said image text information in said common text information format and retrieving said text information and said image text information in response to a single query; and said search engine having a retrieval system for retrieving said image information associated with said retrieved image text information whereby said text information and said image information are retrieved in accordance with said single query
52 The system for storing infoimation in an infoπnation retrieval system of claim 51 , wherein said infoimation retrieval system includes a video display system, forther comprising a display of said retrieved text information and said retrieved image infoimation
53 The system for storing information in an information retrieval system of claim 52, wherein said video display system includes display system windows, wherein said display comprises a simultaneous display of said retrieved text information and said retrieved image information in a single video display system window
54 The system for storing infoimation in an infoπnation retrieval system of claim 51 , forther comprising audio information representative of an audio sound, audio text information representative of text associated with said audio information; an audio information format storing said audio information; said search engine having means for storing said audio text infoimation in said common text infoπnation format whereby the format of said stored audio text information is identical to the format of said stored image text infoimation; and said search engine having a retrieval system for retrieving said audio text information in response to said single query.
55. The system for storing information in an information retrieval system of claim 54, wherein said search engine selects said audio information associated with said retrieved audio text infoimation. 56. A system for storing information on an information retrieval system having a database for retrieval of said input information in response to a query, comprising: text infoimation representative of text; audio infoimation representative of an audio; audio text information representative of text associated with said audio infoimation; an audio infoπnation format for storing said audio infoimation; a common text information format for storing said text information and said audio text infoimation in said database whereby the format of said stored text infoimation is identical to the format of said stored audio text infoimation; a search engine for searching said text information and said audio text infoπnation in said common text infoimation format and retrieving said text information and said audio text information in response to a single query; and said search engine having a retrieval system for selecting said audio infoimation associated with said retrieved audio text information and retrieving said selected audio information whereby said text information and said audio information are retrieved in accordance with the same query.
57. The system for storing infoimation in an infoπnation retrieval system of claim 56, forther comprising: image infoimation representative of an image; image text information representative of text associated with said image; an image infoimation format for storing said image information, said image text information being stored in said common text information format; and said retrieval system being adapted to retrieve said image text information in response to said single query
PCT/US1996/015664 1995-09-15 1996-09-13 Method and apparatus for identifying textual documents and multi-media files corresponding to a search topic WO1997010537A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU72026/96A AU7202696A (en) 1995-09-15 1996-09-13 Method and apparatus for identifying textual documents and multi-media files corresponding to a search topic

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US08/529,233 1995-09-15
US08/529,250 US5742816A (en) 1995-09-15 1995-09-15 Method and apparatus for identifying textual documents and multi-mediafiles corresponding to a search topic
US08/529,233 US5675788A (en) 1995-09-15 1995-09-15 Method and apparatus for generating a composite document on a selected topic from a plurality of information sources
US08/528,683 US5659742A (en) 1995-09-15 1995-09-15 Method for storing multi-media information in an information retrieval system
US08/529,250 1995-09-15
US08/528,683 1995-09-15

Publications (2)

Publication Number Publication Date
WO1997010537A2 true WO1997010537A2 (en) 1997-03-20
WO1997010537A3 WO1997010537A3 (en) 1997-04-24

Family

ID=27415002

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1996/015664 WO1997010537A2 (en) 1995-09-15 1996-09-13 Method and apparatus for identifying textual documents and multi-media files corresponding to a search topic

Country Status (2)

Country Link
AU (1) AU7202696A (en)
WO (1) WO1997010537A2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998057280A1 (en) * 1997-06-10 1998-12-17 Daikin U.S. Corporation A computer implemented method of generating virtual files for sharing information of a physical information file
EP1183686A1 (en) * 2000-02-23 2002-03-06 Havin Co., Ltd. Apparatus and method for searching digital audio data file from media where digital audio data files are recorded
GB2379051A (en) * 2001-06-04 2003-02-26 Hewlett Packard Co System and Method for Combining Voice Annotation and Recognition Search Criteria With Traditional Search Criteria into Metadata
EP1292910A1 (en) * 2000-05-20 2003-03-19 Young-Hie Leem On demand contents providing method and system
US6961700B2 (en) 1996-09-24 2005-11-01 Allvoice Computing Plc Method and apparatus for processing the output of a speech recognition engine
US7734680B1 (en) 1999-09-30 2010-06-08 Koninklijke Philips Electronics N.V. Method and apparatus for realizing personalized information from multiple information sources
USRE42262E1 (en) 1999-08-30 2011-03-29 Frederick Monocacy Llc Method and apparatus for representing and navigating search results

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5241671A (en) * 1989-10-26 1993-08-31 Encyclopaedia Britannica, Inc. Multimedia search system using a plurality of entry path means which indicate interrelatedness of information
US5404506A (en) * 1985-03-27 1995-04-04 Hitachi, Ltd. Knowledge based information retrieval system
US5404435A (en) * 1991-07-29 1995-04-04 International Business Machines Corporation Non-text object storage and retrieval
US5524193A (en) * 1991-10-15 1996-06-04 And Communications Interactive multimedia annotation method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5404506A (en) * 1985-03-27 1995-04-04 Hitachi, Ltd. Knowledge based information retrieval system
US5241671A (en) * 1989-10-26 1993-08-31 Encyclopaedia Britannica, Inc. Multimedia search system using a plurality of entry path means which indicate interrelatedness of information
US5241671C1 (en) * 1989-10-26 2002-07-02 Encyclopaedia Britannica Educa Multimedia search system using a plurality of entry path means which indicate interrelatedness of information
US5404435A (en) * 1991-07-29 1995-04-04 International Business Machines Corporation Non-text object storage and retrieval
US5524193A (en) * 1991-10-15 1996-06-04 And Communications Interactive multimedia annotation method and apparatus

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6961700B2 (en) 1996-09-24 2005-11-01 Allvoice Computing Plc Method and apparatus for processing the output of a speech recognition engine
WO1998057280A1 (en) * 1997-06-10 1998-12-17 Daikin U.S. Corporation A computer implemented method of generating virtual files for sharing information of a physical information file
USRE44794E1 (en) 1999-08-30 2014-03-04 Frederick Monocacy Llc Method and apparatus for representing and navigating search results
USRE42262E1 (en) 1999-08-30 2011-03-29 Frederick Monocacy Llc Method and apparatus for representing and navigating search results
US7734680B1 (en) 1999-09-30 2010-06-08 Koninklijke Philips Electronics N.V. Method and apparatus for realizing personalized information from multiple information sources
EP1183686A1 (en) * 2000-02-23 2002-03-06 Havin Co., Ltd. Apparatus and method for searching digital audio data file from media where digital audio data files are recorded
EP1183687A1 (en) * 2000-02-23 2002-03-06 Havin Co., Ltd. Apparatus and method for recording digital audio data file
EP1183686A4 (en) * 2000-02-23 2004-10-20 Havin Co Ltd Apparatus and method for searching digital audio data file from media where digital audio data files are recorded
EP1183687A4 (en) * 2000-02-23 2004-10-27 Havin Co Ltd Apparatus and method for recording digital audio data file
EP1292910A4 (en) * 2000-05-20 2006-02-08 Young-Hie Leem On demand contents providing method and system
EP1292910A1 (en) * 2000-05-20 2003-03-19 Young-Hie Leem On demand contents providing method and system
GB2379051B (en) * 2001-06-04 2005-12-07 Hewlett Packard Co System and method for combining voice annotation and recognition search criteria with traditional search criteria into metadata
GB2379051A (en) * 2001-06-04 2003-02-26 Hewlett Packard Co System and Method for Combining Voice Annotation and Recognition Search Criteria With Traditional Search Criteria into Metadata

Also Published As

Publication number Publication date
WO1997010537A3 (en) 1997-04-24
AU7202696A (en) 1997-04-01

Similar Documents

Publication Publication Date Title
US5675788A (en) Method and apparatus for generating a composite document on a selected topic from a plurality of information sources
US5721902A (en) Restricted expansion of query terms using part of speech tagging
US5822731A (en) Adjusting a hidden Markov model tagger for sentence fragments
US5737734A (en) Query word relevance adjustment in a search of an information retrieval system
US5717914A (en) Method for categorizing documents into subjects using relevance normalization for documents retrieved from an information retrieval system in response to a query
US5742816A (en) Method and apparatus for identifying textual documents and multi-mediafiles corresponding to a search topic
US5873076A (en) Architecture for processing search queries, retrieving documents identified thereby, and method for using same
US5659742A (en) Method for storing multi-media information in an information retrieval system
US5640553A (en) Relevance normalization for documents retrieved from an information retrieval system in response to a query
US6208988B1 (en) Method for identifying themes associated with a search query using metadata and for organizing documents responsive to the search query in accordance with the themes
US6826576B2 (en) Very-large-scale automatic categorizer for web content
JP3270783B2 (en) Multiple document search methods
US6212517B1 (en) Keyword extracting system and text retrieval system using the same
US8849787B2 (en) Two stage search
US7860872B2 (en) Automated media analysis and document management system
US20070185859A1 (en) Novel systems and methods for performing contextual information retrieval
US20070078814A1 (en) Novel information retrieval systems and methods
US20120240020A1 (en) System and method for document collection, grouping and summarization
US20040249808A1 (en) Query expansion using query logs
US20030061201A1 (en) System for propagating enrichment between documents
US20030033288A1 (en) Document-centric system with auto-completion and auto-correction
US20070136280A1 (en) Factoid-based searching
JP2005302042A (en) Term suggestion for multi-sense query
KR20040013097A (en) Category based, extensible and interactive system for document retrieval
JP2001519952A (en) Data summarization device

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AU CA CN JP MX NZ

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA