Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUS20040163034 A1
Publication typeApplication
Application numberUS 10/685,479
Publication date19 Aug 2004
Filing date16 Oct 2003
Priority date17 Oct 2002
Also published asUS7292977, US7389229, US7424427, US20040083090, US20040083104, US20040138894, US20040172250, US20040176946, US20040204939, US20040230432, US20050038649
Publication number10685479, 685479, US 2004/0163034 A1, US 2004/163034 A1, US 20040163034 A1, US 20040163034A1, US 2004163034 A1, US 2004163034A1, US-A1-20040163034, US-A1-2004163034, US2004/0163034A1, US2004/163034A1, US20040163034 A1, US20040163034A1, US2004163034 A1, US2004163034A1
InventorsSean Colbath, Francis Kubala
Original AssigneeSean Colbath, Kubala Francis G.
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
Systems and methods for labeling clusters of documents
US 20040163034 A1
Abstract
A system (520) generates labels for clusters of documents. The system (520) identifies topics associated with the documents in the clusters and determines whether the topics are associated with approximately half or more of the documents in the clusters. The system (520) then generates labels for the clusters using the topics that are associated with approximately half or more of the documents in the clusters.
Images(10)
Previous page
Next page
Claims(19)
What is claimed is:
1. A method of creating labels for clusters of documents, comprising:
identifying topics associated with the documents in the clusters;
determining whether the topics are associated with at least half of the documents in the clusters;
adding ones of the topics that are associated with at least half of the documents in the clusters to cluster lists; and
forming labels for the clusters from the cluster lists.
2. The method of claim 1, wherein the identifying topics includes:
using a probabilistic Hidden Markov Model to determine the topics.
3. The method of claim 1, wherein the forming labels includes:
ranking the ones of the topics, and
placing the ones of the topics in the labels in ranked order.
4. The method of claim 3, wherein the ranking the ones of the topics includes:
assigning ranks to the ones of the topics based on a number of the documents with which the ones of the topics are associated.
5. The method of claim 1, further comprising:
ranking the ones of the topics based on a number of the documents with which the ones of the topics are associated.
6. The method of claim 5, wherein when a first one of the ones of the topics, as a first topic, is associated with a majority of the documents in one of the clusters and a second one of the ones of the topics, as a second topic, is associated with less than the majority of the documents in the one of the clusters, the first topic is ranked higher than the second topic.
7. The method of claim 5, wherein the ranking the ones of the topics includes:
assigning higher ranks to first ones of the ones of the topics that are associated with larger numbers of the documents than second ones of the ones of the topics that are associated with smaller numbers of the documents.
8. The method of claim 5, wherein the forming labels includes:
sorting the cluster lists based on the rankings of the ones of the topics.
9. A system for generating a label for a cluster of documents, comprising:
means for identifying topics associated with the documents in the cluster;
means for determining whether the topics are associated with at least half of the documents in the cluster; and
means for generating a label for the cluster based on one or more of the topics that are associated with at least half of the documents in the cluster.
10. The system of claim 9, further comprising:
means for ranking the one or more of the topics based on a number of the documents with which the one or more of the topics are associated.
11. The system of claim 10, wherein the means for generating a label includes:
means for sorting the one or more of the topics based on the ranking to form the label for the cluster.
12. A system for creating a label for a cluster of documents, comprising:
logic configured to identify topics associated with the documents in the cluster;
logic configured to determine whether the topics are associated with approximately half or more of the documents in the cluster;
logic configured to rank ones of the topics that that are associated with approximately half or more of the documents in the cluster; and
logic configured to generate a label for the cluster using the ones of the topics in ranked order.
13. The system of claim 12, wherein when a first one of the ones of the topics, as a first topic, is associated with a majority of the documents in the cluster and a second one of the ones of the topics, as a second topic, is associated with less than the majority of the documents in the cluster, the first topic is ranked higher than the second topic.
14. The system of claim 12, wherein the logic configured to rank ones of the topics includes:
logic configured to assign higher ranks to first ones of the ones of the topics that are associated with larger numbers of the documents than second ones of the ones of the topics that are associated with smaller numbers of the documents.
15. The system of claim 12, wherein the logic configured to generate a label includes:
logic configured to sort the ones of the topics based on the rankings of the ones of the topics.
16. A topic detection system, comprising:
a decision engine configured to:
receive a plurality of documents, and
group the documents into a plurality of clusters; and a label engine configured to:
identify topics associated with the documents in the clusters,
determine whether the topics are associated with at least half of the documents in the clusters, and
form labels for the clusters using ones of the topics that are associated with at least half of the documents in the clusters.
17. The system of claim 16, wherein the label engine is further configured to:
rank the ones of the topics based on a number of the documents with which the ones of the topics are associated.
18. A method for creating labels for clusters of documents, comprising:
identifying topics associated with the documents in the clusters;
determining whether the topics are associated with a predetermined portion of the documents in the clusters; and
generating labels for the clusters using ones of the topics that are associated with approximately half or more of the documents in the clusters.
19. The method of claim 18, wherein the predetermined portion of the documents is equal to approximately half of the documents.
Description
    RELATED APPLICATION
  • [0001]
    This application is related to U.S. application Ser. No. 10/ ______ (Docket No. 02-4034), entitled “SYSTEMS AND METHODS FOR INTERACTIVE CLUSTERING OF DOCUMENTS,” filed concurrently herewith and incorporated herein by reference.
  • [0002]
    This application claims priority under 35 U.S.C. 119 based on U.S. Provisional Application No. 60/419,214, filed Oct. 17, 2002, the contents of which are incorporated herein by reference.
  • GOVERNMENT CONTRACT
  • [0003] The U.S. Government may have a paid-up license in this invention and the right in limited circumstances to require the patent owner to license others on reasonable terms as provided for by the terms of Contract No. N66001-00-C-8008 awarded by the Defense Advanced Research Projects Agency.
  • BACKGROUND OF THE INVENTION
  • [0004]
    1. Field of the Invention
  • [0005]
    The present invention relates generally to multimedia environments and, more particularly, to systems and methods for labeling clusters of similar documents.
  • [0006]
    2. Description of Related Art
  • [0007]
    When trying to organize large collections of documents, it is sometimes useful to organize these documents into similar groupings, where similarity is determined by some metric, such as the topics of the documents or their relevance to a particular event. Conventional systems typically receive streams of documents and group the documents into clusters that ideally concern a single event, or more typically, a single topic.
  • [0008]
    One particular conventional system includes an event or topic detection system that uses natural language techniques to make a decision about each of the documents it receives. The decision involves the determination of whether a particular document relates to a new event (or topic) that the system has not seen before or an existing event (or topic) that the system has seen before. If the document relates to a new event, then the system creates a new cluster and assigns the document to this new cluster. If the document, instead, relates to an existing event, then the system assigns the document to an existing cluster relating to the event.
  • [0009]
    The system usually operates based on a set of rules. One rule is that a document can only be assigned to one cluster. Another rule is that the clusters can only grow and may never be broken. To this effect, the system may never revisit documents that have already been assigned to clusters to determine whether the documents should have been assigned to different clusters.
  • [0010]
    The conventional system usually presents the clusters to an end user with no labeling other than, possibly, the number of documents in the clusters. This is of limited usefulness to a user looking for a document in one of the clusters.
  • [0011]
    As a result, there is a need for a labeling scheme that creates cluster labels that are indicative of the documents in the clusters and are meaningful to an end user.
  • SUMMARY OF THE INVENTION
  • [0012]
    Systems and methods consistent with the present invention address this and other needs by creating labels for clusters based on document topics that are associated with at least half of the documents in the clusters. The topics may be ranked based on the number of documents relating to the corresponding topics. The topics may then be presented in rank order as labels for the clusters.
  • [0013]
    In one aspect consistent with the principles of the invention, a system that generates labels for clusters of documents is provided. The system identifies topics associated with the documents in the clusters and determines whether the topics are associated with approximately half or more of the documents in the clusters. The system then generates labels for the clusters using the topics that are associated with approximately half or more of the documents in the clusters.
  • [0014]
    In another aspect consistent with the present invention, a method of creating labels for clusters of documents is provided. The method includes identifying topics associated with the documents in the clusters; determining whether the topics are associated with at least half of the documents in the clusters; adding ones of the topics that are associated with at least half of the documents in the clusters to cluster lists; and forming labels for the clusters from the cluster lists.
  • [0015]
    In yet another aspect consistent with the present invention, a system for creating a label for a cluster of documents is provided. The system is configured to identify topics associated with the documents in the cluster and determine whether the topics are associated with approximately half or more of the documents in the cluster. The system is further configured to rank the topics that that are associated with approximately half or more of the documents in the cluster and generate a label for the cluster using the ranked topics.
  • [0016]
    In a further aspect consistent with the present invention, a topic detection system is provided. The topic detection system includes a decision engine and a label engine. The decision engine is configured to receive documents and group the documents into clusters. The label engine is configured to identify topics associated with the documents in the clusters, determine whether the topics are associated with at least half of the documents in the clusters, and form labels for the clusters using the topics that are associated with at least half of the documents in the clusters.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • [0017]
    The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate the invention and, together with the description, explain the invention. In the drawings,
  • [0018]
    [0018]FIG. 1 is a diagram of a system in which systems and methods consistent with the present invention may be implemented;
  • [0019]
    [0019]FIG. 2 is an exemplary diagram of the server system of FIG. 1 according to an implementation consistent with the principles of the invention;
  • [0020]
    [0020]FIG. 3 is an exemplary diagram of the server of FIG. 2 according to an implementation consistent with the principles of the invention;
  • [0021]
    [0021]FIG. 4 is an exemplary diagram of a portion of the indexing system of FIG. 2 according to an implementation consistent with the principles of the invention;
  • [0022]
    [0022]FIG. 5 is an exemplary diagram of the event detection system of FIG. 2 according to an implementation consistent with the present invention;
  • [0023]
    [0023]FIG. 6 is a flowchart of exemplary processing for grouping documents into clusters according to an implementation consistent with the principles of the invention;
  • [0024]
    [0024]FIG. 7 is a flowchart of exemplary processing for creating a label for a cluster according to an implementation consistent with the principles of the invention; and
  • [0025]
    [0025]FIGS. 8A and 8B are exemplary diagrams of a graphical user interface that may be presented to a user according to an implementation consistent with the principles of the invention.
  • DETAILED DESCRIPTION
  • [0026]
    The following detailed description of the invention refers to the accompanying drawings. The same reference numbers in different drawings may identify the same or similar elements. Also, the following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims and equivalents.
  • [0027]
    Systems and methods consistent with the present invention create cluster labels that are indicative of the documents in the clusters and are meaningful to an end user. The labels may be based on document topics that are associated with at least half of the documents in the clusters. The topics may be ranked based on their occurrence in the documents of the cluster. The topics may then be presented in rank order as a label for the cluster.
  • [0028]
    In the discussion that follows, a document corresponds to a body of media that is contiguous in time (from beginning to end or from time A to time B). Documents might include audio documents (e.g., radio broadcasts), video documents (e.g., television broadcasts), and/or text documents (e.g., word processing documents) in any language.
  • Exemplary System
  • [0029]
    [0029]FIG. 1 is a diagram of an exemplary system 100 in which systems and methods consistent with the present invention may be implemented. System 100 may include clients 110 connected to server system 120 via a network 130. Network 130 may include any type of network, such as a local area network (LAN), a wide area network (WAN), a public telephone network (e.g., the Public Switched Telephone Network (PSTN)), a virtual private network (VPN), or a combination of networks. Clients 110 and server system 120 may connect to network 130 via wired, wireless, and/or optical connections.
  • [0030]
    Generally, clients 110 may interact with server system 120 to obtain documents of interest. A user of one of clients 110 may then cause the documents to be automatically grouped into clusters on demand. A client 110 may include a personal computer, a laptop, a personal digital assistant, or another type of device that is capable of interacting with server system 120 to obtain documents of interest. A client 110 may present the documents to a user via a graphical user interface (GUI), possibly within a web browser window.
  • [0031]
    Generally, server system 120 may process and maintain documents. Server system 120 may receive documents in a wide variety of formats (e.g., audio, video, and text) and process the documents to extract features and other relevant information from the documents. Server system 120 may also group documents into clusters and, when requested, provide documents to clients 110.
  • [0032]
    [0032]FIG. 2 is an exemplary diagram of server system 120 according to an implementation consistent with the principles of the invention. Server system 120 may include a server 210, an indexing system 220, an event detection system 230, and a database 240 connected via a network 250. Network 250 may include a LAN, WAN, the Internet, network 130, or other types of direct or indirect connections.
  • [0033]
    Server 210 may include a computer or another type of device capable of interacting with clients 110. In one implementation consistent with the principles of the invention, server 210 includes indexing system 220 and/or event detection system 230.
  • [0034]
    [0034]FIG. 3 is an exemplary diagram of server 210 according to an implementation consistent with the principles of the invention. Server 210 may include bus 310, processor 320, main memory 330, read only memory (ROM) 340, storage device 350, input device 360, output device 370, and communication interface 380. Bus 310 permits communication among the components of server 210.
  • [0035]
    Processor 320 may include any type of conventional processor or microprocessor that interprets and executes instructions. Main memory 330 may include a random access memory (RAM) or another type of dynamic storage device that stores information and instructions for execution by processor 320. ROM 340 may include a conventional ROM device or another type of static storage device that stores static information and instructions for use by processor 320. Storage device 350 may include a magnetic and/or optical recording medium and its corresponding drive.
  • [0036]
    Input device 360 may include one or more conventional mechanisms that permit an operator to input information to server 210, such as a keyboard, a mouse, a pen, voice recognition and/or biometric mechanisms, etc. Output device 370 may include one or more conventional mechanisms that output information to the operator, including a display, a printer, a speaker, etc. Communication interface 380 may include any transceiver-like mechanism that enables server 210 to communicate with other devices and/or systems. For example, communication interface 380 may include mechanisms for communicating with another device or system via a network, such as network 250 or network 130.
  • [0037]
    As will be described in detail below, server 210, consistent with the present invention, may interact with clients 110, event detection system 230, and/or database 240 to provide documents of interest. Server 210 may perform these tasks in response to processor 320 executing sequences of instructions contained in, for example, memory 330. Alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement processes consistent with the present invention. Thus, processes performed by server 210 are not limited to any specific combination of hardware circuitry and software.
  • [0038]
    Returning to FIG. 2, indexing system 220 may receive document data, including real time data, in a variety of formats (e.g., audio, video, and text), process the data to extract features and other relevant information from the documents, and record the date and time at which the documents were created. In one implementation consistent with the principles of the invention, indexing system 220 may include mechanisms, such as the ones described in John Makhoul et al., “Speech and Language Technologies for Audio Indexing and Retrieval,” Proceedings of the IEEE, Vol. 88, No. 8, August 2000, pp. 1338-1353, which is incorporated herein by reference.
  • [0039]
    [0039]FIG. 4 is an exemplary diagram of a portion of indexing system 220 according to an implementation consistent with the principles of the invention. The portion of indexing system 220 shown in FIG. 4 operates upon audio documents. Indexing system 220 may include similar or dissimilar mechanisms for operating upon other types of media, such as video and text.
  • [0040]
    As shown in FIG. 4, indexing system 220 includes audio classification logic 410, speech recognition logic 420, speaker clustering logic 430, speaker identification logic 440, name spotting logic 450, topic classification logic 460, and story segmentation logic 470. Audio classification logic 410 may distinguish speech from silence, noise, and other audio signals in input audio data. For example, audio classification logic 410 may analyze each thirty second window of the input data to determine whether it contains speech. Audio classification logic 410 may also identify boundaries between speakers in the input stream. Audio classification logic 410 may group speech segments from the same speaker and send the segments to speech recognition logic 420.
  • [0041]
    Speech recognition logic 420 may perform continuous speech recognition to recognize the words spoken in the segments that it receives from audio classification logic 410. Speech recognition logic 420 may generate a transcription of the speech using a statistical language model. Speaker clustering logic 430 may identify all of the segments from the same speaker in a single document and group them into speaker clusters. Speaker clustering logic 430 may then assign each of the speaker clusters a unique label. Speaker identification logic 440 may identify the speaker in each speaker cluster by name or gender.
  • [0042]
    Name spotting logic 450 may locate the names of people, places, and organizations in the transcription. Name spotting logic 450 may extract the names and store them in a database. Topic classification logic 460 may use a probabilistic Hidden Markov Model (HMM) to assign topics to the transcription. In one implementation consistent with the present invention, topic classification logic 460 uses a technique similar to the one described in John Makhoul et al., “Speech and Language Technologies for Audio Indexing and Retrieval,” Proceedings of the IEEE, Vol. 88, No. 8, August 2000, pp. 1338-1353, which was previously incorporated by reference. Topic classification logic 460 may generate a rank-ordered list of all possible topics and corresponding scores for the transcription.
  • [0043]
    Story segmentation logic 470 may change the continuous stream of words in the transcription into document-like units with coherent sets of topic labels and other document features generated or identified by the components of indexing system 220. This information may constitute metadata corresponding to the input audio data. Story segmentation logic 470 may store the metadata in database 240.
  • [0044]
    Returning to FIG. 2, event detection system 230 may group documents into clusters based on events or topics to which the documents relate. FIG. 5 is an exemplary diagram of event detection system 230 according to an implementation consistent with the principles of the invention. Event detection system 230 may include a decision engine 510 and a label engine 520. The decision engine 510 may include a conventional event or topic detection system, such as the Topic Detection Tracking system developed by the University of Massachusetts, Amherst, as described in J. Allan et al., “UMass at TDT2000,” November 2000, pages 109-115.
  • [0045]
    Decision engine 510 may include logic that receives a stream of documents over time from, for example, indexing system 220 and/or server 210, and determines, for each of the documents, whether the document is related to an event or topic that decision engine 510 has seen before. If the document is related to a new event or topic (i.e., one that has not yet been seen by decision engine 510), then decision engine 510 may create a new cluster relating to the event or topic and assign the document to the new cluster. If the document is, instead, related to an existing event or topic, then decision engine 510 may assign the document to an existing cluster that is also related to the event or topic.
  • [0046]
    Decision engine 510 may follow the same rules as conventional systems. In other words, decision engine 510 may assign a document to only one cluster. Decision engine 510 may also get only one chance to make a decision about a document and, thereafter, may not change its decision regarding the cluster to which the document is assigned. Decision engine 510 may store its document assignment decisions in an internal memory or, alternatively, in database 240.
  • [0047]
    Label engine 520 may include logic that creates labels for the clusters generated by decision engine 510. In another implementation, the functions of label engine 520 are performed by server 210. For each of the clusters, label engine 520 may examine the topics assigned to the cluster documents by indexing system 220. Label engine 520 may then label the cluster with the topics that appear on at least half of the documents in the cluster. The theory is that if a topic does not appear on at least half of the documents in the cluster, then the topic is not representative of the cluster.
  • [0048]
    Label engine 520 may rank the topics assigned to a cluster. For example, a topic that is associated with more of the documents in the cluster may be ranked higher than a topic associated with fewer of the documents in the cluster. This ranked list of topics may form a label for the cluster. The clusters with attached labels may be presented to a user via client 110.
  • [0049]
    Returning to FIG. 2, database 240 may include a relational database that stores documents from indexing system 220 and, possibly, cluster information from event detection system 230. The contents of database 240 may be accessible to users via clients 110.
  • Exemplary Processing
  • [0050]
    [0050]FIG. 6 is a flowchart of exemplary processing for grouping documents into clusters according to an implementation consistent with the principles of the invention. Processing may begin with decision engine 510 receiving a stream of documents over time (act 610). Decision engine 510 may receive the documents from indexing system 220 and/or server 210.
  • [0051]
    Decision engine 510 may operate upon the documents to group the documents into clusters (act 620). For example, decision engine 510 may determine, for each of the documents, whether the document relates to a new event (or topic) that decision engine 510 has not seen before or an existing event (or topic) that decision engine 510 has seen before. If the document relates to a new event (or topic), then decision engine 510 creates a new cluster and assigns the document to this new cluster. If the document, instead, relates to an existing event (or topic), then decision engine 510 assigns the document to an existing cluster relating to the event (or topic).
  • [0052]
    Label engine 520 may create labels for the clusters generated by decision engine 510 (act 630). Label engine 520 may create a label or reassess a previous label assignment for a cluster on a periodic basis, when a new document is assigned to the cluster, or when cluster information is requested by a user (via client 110).
  • [0053]
    [0053]FIG. 7 is a flowchart of exemplary processing for creating a label for a cluster according to an implementation consistent with the principles of the invention. Processing may begin with label engine 520 identifying the topics assigned to the documents in the cluster (act 710). In one implementation, label engine 520 obtains the topic information from indexing system 220. In another implementation, label 520 generates the topic information, possibly using a technique similar to the one described in John Makhoul et al., “Speech and Language Technologies for Audio Indexing and Retrieval,” Proceedings of the IEEE, Vol. 88, No. 8, August 2000, pp. 1338-1353, which was previously incorporated by reference. In yet another implementation, label 520 obtains the topic information in some other manner.
  • [0054]
    Label engine 520 may then examine each of the topics in the cluster. For example, label engine 520 may determine whether a topic M (where M≧1) is associated with at least half of the documents in the cluster (act 720). If so, label engine 520 may add topic M to a cluster list (act 730). If topic M is not associated with at least half of the documents in the cluster, label engine 520 may determine whether all of the topics in the cluster have been considered (act 740). If one or more topics have not yet been considered, then label engine 520 may examine the next topic (M+1), returning to act 720.
  • [0055]
    If all of the topics have been considered, then label engine 520 may rank the topics in the cluster list to form a label for the cluster (act 750). For example, label engine 520 may rank a topic that is associated with the majority of the documents in the cluster higher than all other topics. Label engine 520 may rank the topic associated with the next highest majority of the documents in the cluster higher than all other remaining topics, and so on down to one or more topics that are associated with half of the documents in the cluster. Label engine 520 may use this ranked list of topics to form a label for the cluster.
  • [0056]
    Label engine 520, or event detection system 230, may store cluster information in database 240. The cluster information, in this case, may include information regarding the clusters to which the documents are assigned and the labels associated with those clusters.
  • [0057]
    Returning to FIG. 6, server 210 may present the cluster information to a user upon request (act 640). For example, server 210 may send the cluster information to client 110 for display via, for example, a graphical user interface, such as a browser interface. The cluster information may be presented to the user as a list of clusters that may be sorted based on the number of documents contained in the clusters. In other words, clusters containing larger numbers of documents may be presented higher on the list than clusters containing fewer numbers of documents. The clusters may include assigned labels to make the cluster list meaningful to the user.
  • [0058]
    [0058]FIGS. 8A and 8B are exemplary diagrams of a graphical user interface that may be presented to a user according to an implementation consistent with the principles of the invention. If the user requests to view the clusters generated by event detection system 230, the user may be presented with a graphical user interface, possibly in the form of a browser interface, such as graphical user interface 800 in FIG. 8A. Graphical user interface 800 may include cluster data 810, barchart view option 820, and timeline view option 830.
  • [0059]
    Cluster data 810 may include data that identifies the current document count and the current cluster count. The current document count may specify the total number of documents that have been received and processed by event detection system 230. The current cluster count may specify the total number of clusters in which the documents have been grouped.
  • [0060]
    Barchart view option 820 and timeline view option 830 are two manners by which the clusters may be presented to the user. In other implementations consistent with the present invention, there are more or fewer ways of presenting the clusters to the user. Barchart view option 820 may display the clusters in the form of a barchart. Timeline view option 830 may display the clusters in the form of a timeline.
  • [0061]
    [0061]FIG. 8B is an exemplary diagram of graphical user interface 800 that may be presented when providing clusters in barchart form according to an implementation consistent with the principles of the invention. Graphical user interface 800 may present the clusters as a series of bars, the length of which relate to the number of documents in the clusters. The bars may be sorted by cluster size, with larger clusters being presented first. Each of the bars may have an associated label that corresponds to the label generated for the cluster by label engine 520.
  • [0062]
    The user may select one of the bars to view the documents included in the cluster. The documents may then be presented to the user in chronological order (i.e., sorted based on the date and time at which the document was created), with the more recent documents being presented first. In other implementations, the documents are presented in other ways.
  • CONCLUSION
  • [0063]
    Systems and methods consistent with the present invention create labels for clusters of documents, such that the labels are indicative of the documents in the cluster and are valuable to an end user seeking a document in one of the clusters. The labels may be based on document topics that are associated with at least half of the documents in the clusters. The topics may be ranked based on the number of documents with which the topics are associated. The topics may then be presented in rank order as a label for the cluster.
  • [0064]
    The foregoing description of preferred embodiments of the present invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention.
  • [0065]
    For example, it has been described that only topics that are associated with at least half of the documents in the cluster are used for the cluster label. In other implementations, the criteria is changed to include topics associated with more or fewer than half of the documents.
  • [0066]
    While series of acts have been described with regard to FIGS. 6 and 7, the order of the acts may be varied in other implementations consistent with the principles of the invention. Also, non-dependent acts may be performed in parallel.
  • [0067]
    Further, certain portions of the invention have been described as “logic” that performs one or more functions. This logic may include hardware, such as an application specific integrated circuit or a field programmable gate array, software, or a combination of hardware and software.
  • [0068]
    No element, act, or instruction used in the description of the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. The scope of the invention is defined by the claims and their equivalents.
Patent Citations
Cited PatentFiling datePublication dateApplicantTitle
US4879648 *19 Sep 19867 Nov 1989Nancy P. CochranSearch system which continuously displays search terms during scrolling and selections of individually displayed data sets
US4908866 *4 Feb 198513 Mar 1990Eric GoldwasserSpeech transcribing system
US5317732 *26 Apr 199131 May 1994Commodore Electronics LimitedSystem for relocating a multimedia presentation on a different platform by extracting a resource map in order to remap and relocate resources
US5404295 *4 Jan 19944 Apr 1995Katz; BorisMethod and apparatus for utilizing annotations to facilitate computer retrieval of database material
US5418716 *26 Jul 199023 May 1995Nec CorporationSystem for recognizing sentence patterns and a system for recognizing sentence patterns and grammatical cases
US5544257 *8 Jan 19926 Aug 1996International Business Machines CorporationContinuous parameter hidden Markov model approach to automatic handwriting recognition
US5559875 *31 Jul 199524 Sep 1996Latitude CommunicationsMethod and apparatus for recording and retrieval of audio conferences
US5572728 *27 Dec 19945 Nov 1996Hitachi, Ltd.Conference multimedia summary support system and method
US5613032 *2 Sep 199418 Mar 1997Bell Communications Research, Inc.System and method for recording, playing back and searching multimedia events wherein video, audio and text can be searched and retrieved
US5614940 *21 Oct 199425 Mar 1997Intel CorporationMethod and apparatus for providing broadcast information with indexing
US5684924 *19 May 19954 Nov 1997Kurzweil Applied Intelligence, Inc.User adaptable speech recognition system
US5715367 *23 Jan 19953 Feb 1998Dragon Systems, Inc.Apparatuses and methods for developing and using models for speech recognition
US5752021 *18 May 199512 May 1998Fuji Xerox Co., Ltd.Document database management apparatus capable of conversion between retrieval formulae for different schemata
US5757960 *28 Feb 199726 May 1998Murdock; Michael ChaseMethod and system for extracting features from handwritten text
US5768607 *30 Apr 199616 Jun 1998Intel CorporationMethod and apparatus for freehand annotation and drawings incorporating sound and for compressing and synchronizing sound
US5777614 *13 Oct 19957 Jul 1998Hitachi, Ltd.Editing support system including an interactive interface
US5787198 *25 Oct 199428 Jul 1998Lucent Technologies Inc.Text recognition using two-dimensional stochastic models
US5806032 *14 Jun 19968 Sep 1998Lucent Technologies Inc.Compilation of weighted finite-state transducers from decision trees
US5835667 *14 Oct 199410 Nov 1998Carnegie Mellon UniversityMethod and apparatus for creating a searchable digital video library and a system and method of using such a library
US5862259 *27 Mar 199619 Jan 1999Caere CorporationPattern recognition employing arbitrary segmentation and compound probabilistic evaluation
US5875108 *6 Jun 199523 Feb 1999Hoffberg; Steven M.Ergonomic man-machine interface incorporating adaptive pattern recognition based control system
US5960447 *13 Nov 199528 Sep 1999Holt; DouglasWord tagging and editing system for speech recognition
US5963940 *14 Aug 19965 Oct 1999Syracuse UniversityNatural language information retrieval system and method
US5970473 *31 Dec 199719 Oct 1999At&T Corp.Video communication device providing in-home catalog services
US6006184 *28 Jan 199821 Dec 1999Nec CorporationTree structured cohort selection for speaker recognition system
US6024571 *24 Apr 199715 Feb 2000Renegar; Janet ElaineForeign language communication system/device and learning aid
US6026388 *14 Aug 199615 Feb 2000Textwise, LlcUser interface and other enhancements for natural language information retrieval system and method
US6029124 *31 Mar 199822 Feb 2000Dragon Systems, Inc.Sequential, nonparametric speech recognition and speaker identification
US6029195 *5 Dec 199722 Feb 2000Herz; Frederick S. M.System for customized electronic identification of desirable objects
US6052657 *25 Nov 199718 Apr 2000Dragon Systems, Inc.Text segmentation and identification of topic using language models
US6064963 *17 Dec 199716 May 2000Opus Telecom, L.L.C.Automatic key word or phrase speech recognition for the corrections industry
US6067514 *23 Jun 199823 May 2000International Business Machines CorporationMethod for automatically punctuating a speech utterance in a continuous speech recognition system
US6067517 *2 Feb 199623 May 2000International Business Machines CorporationTranscription of speech data with segments from acoustically dissimilar environments
US6073096 *4 Feb 19986 Jun 2000International Business Machines CorporationSpeaker adaptation system and method based on class-specific pre-clustering training speakers
US6076053 *21 May 199813 Jun 2000Lucent Technologies Inc.Methods and apparatus for discriminative training and adaptation of pronunciation networks
US6088669 *28 Jan 199711 Jul 2000International Business Machines, CorporationSpeech recognition with attempted speaker recognition for speaker model prefetching or alternative speech modeling
US6112172 *31 Mar 199829 Aug 2000Dragon Systems, Inc.Interactive searching
US6151598 *4 Dec 199721 Nov 2000Shaw; Venson M.Digital dictionary with a communication system for the creating, updating, editing, storing, maintaining, referencing, and managing the digital dictionary
US6169789 *4 Jun 19992 Jan 2001Sanjay K. RaoIntelligent keyboard system
US6185531 *9 Jan 19986 Feb 2001Gte Internetworking IncorporatedTopic indexing method
US6219640 *6 Aug 199917 Apr 2001International Business Machines CorporationMethods and apparatus for audio-visual speaker recognition and utterance verification
US6246983 *5 Aug 199812 Jun 2001Matsushita Electric Corporation Of AmericaText-to-speech e-mail reader with multi-modal reply processor
US6253179 *29 Jan 199926 Jun 2001International Business Machines CorporationMethod and apparatus for multi-environment speaker verification
US6266667 *14 Jan 199924 Jul 2001Telefonaktiebolaget Lm Ericsson (Publ)Information routing
US6308222 *30 Nov 199923 Oct 2001Microsoft CorporationTranscoding of audio data
US6317716 *18 Sep 199813 Nov 2001Massachusetts Institute Of TechnologyAutomatic cueing of speech
US6345252 *9 Apr 19995 Feb 2002International Business Machines CorporationMethods and apparatus for retrieving audio information using content and speaker information
US6347295 *26 Oct 199812 Feb 2002Compaq Computer CorporationComputer method and apparatus for grapheme-to-phoneme rule-set-generation
US6360234 *14 Aug 199819 Mar 2002Virage, Inc.Video cataloger system with synchronized encoders
US6360237 *5 Oct 199819 Mar 2002Lernout & Hauspie Speech Products N.V.Method and system for performing text edits during audio recording playback
US6373985 *12 Aug 199816 Apr 2002Lucent Technologies, Inc.E-mail signature block analysis
US6381640 *19 Feb 199930 Apr 2002Genesys Telecommunications Laboratories, Inc.Method and apparatus for automated personalization and presentation of workload assignments to agents within a multimedia communication center
US6434520 *16 Apr 199913 Aug 2002International Business Machines CorporationSystem and method for indexing and querying audio archives
US6437818 *5 May 199820 Aug 2002Collaboration Properties, Inc.Video conferencing on existing UTP infrastructure
US6463444 *14 Aug 19988 Oct 2002Virage, Inc.Video cataloger system with extensibility
US6480826 *31 Aug 199912 Nov 2002Accenture LlpSystem and method for a telephonic emotion detection that provides operator feedback
US6567980 *14 Aug 199820 May 2003Virage, Inc.Video cataloger system with hyperlinked output
US6571208 *29 Nov 199927 May 2003Matsushita Electric Industrial Co., Ltd.Context-dependent acoustic models for medium and large vocabulary speech recognition with eigenvoice training
US6602300 *3 Sep 19985 Aug 2003Fujitsu LimitedApparatus and method for retrieving data from a document database
US6604110 *31 Oct 20005 Aug 2003Ascential Software, Inc.Automated software code generation from a metadata-based repository
US6611803 *14 Dec 199926 Aug 2003Matsushita Electric Industrial Co., Ltd.Method and apparatus for retrieving a video and audio scene using an index generated by speech recognition
US6624826 *28 Sep 199923 Sep 2003Ricoh Co., Ltd.Method and apparatus for generating visual representations for audio documents
US6647383 *1 Sep 200011 Nov 2003Lucent Technologies Inc.System and method for providing interactive dialogue and iterative search functions to find information
US6654735 *8 Jan 199925 Nov 2003International Business Machines CorporationOutbound information analysis for generating user interest profiles and improving user productivity
US6708148 *9 Oct 200216 Mar 2004Koninklijke Philips Electronics N.V.Correction device to mark parts of a recognized text
US6711541 *7 Sep 199923 Mar 2004Matsushita Electric Industrial Co., Ltd.Technique for developing discriminative sound units for speech recognition and allophone modeling
US6714911 *15 Nov 200130 Mar 2004Harcourt Assessment, Inc.Speech transcription and analysis system and method
US6718303 *13 May 19996 Apr 2004International Business Machines CorporationApparatus and method for automatically generating punctuation marks in continuous speech recognition
US6718305 *17 Mar 20006 Apr 2004Koninklijke Philips Electronics N.V.Specifying a tree structure for speech recognizers using correlation between regression classes
US6728673 *9 May 200327 Apr 2004Matsushita Electric Industrial Co., LtdMethod and apparatus for retrieving a video and audio scene using an index generated by speech recognition
US6732183 *4 May 20004 May 2004Broadware Technologies, Inc.Video and audio streaming for multiple users
US6748356 *7 Jun 20008 Jun 2004International Business Machines CorporationMethods and apparatus for identifying unknown speakers using a hierarchical tree structure
US6778958 *30 Aug 200017 Aug 2004International Business Machines CorporationSymbol insertion apparatus and method
US6778979 *5 Dec 200117 Aug 2004Xerox CorporationSystem for automatically generating queries
US6792409 *19 Dec 200014 Sep 2004Koninklijke Philips Electronics N.V.Synchronous reproduction in a speech recognition system
US6847961 *12 Nov 200225 Jan 2005Silverbrook Research Pty LtdMethod and system for searching information using sensor with identifier
US6877134 *29 Jul 19995 Apr 2005Virage, Inc.Integrated data and real-time metadata capture system and method
US6922691 *21 Apr 200226 Jul 2005Emotion, Inc.Method and apparatus for digital media management, retrieval, and collaboration
US6931376 *14 Jun 200116 Aug 2005Microsoft CorporationSpeech-related event notification system
US6961954 *2 Mar 19981 Nov 2005The Mitre CorporationAutomated segmentation, information extraction, summarization, and presentation of broadcast news
US6999918 *20 Sep 200214 Feb 2006Motorola, Inc.Method and apparatus to facilitate correlating symbols to sounds
US7131117 *4 Sep 200231 Oct 2006Sbc Properties, L.P.Method and system for automating the analysis of word frequencies
US7171360 *7 May 200230 Jan 2007Koninklijke Philips Electronics N.V.Background learning of speaker voices
US7257528 *13 Feb 199814 Aug 2007Zi Corporation Of Canada, Inc.Method and apparatus for Chinese character text input
US20010026377 *19 Mar 20014 Oct 2001Katsumi IkegamiImage display system, image registration terminal device and image reading terminal device used in the image display system
US20020001261 *20 Apr 20013 Jan 2002Yoshinori MatsuiData playback apparatus
US20020010575 *2 Aug 200124 Jan 2002International Business Machines CorporationMethod and system for the automatic segmentation of an audio stream into semantic or syntactic units
US20020010916 *12 Feb 200124 Jan 2002Compaq Computer CorporationApparatus and method for controlling rate of playback of audio data
US20020049589 *27 Jun 200125 Apr 2002Poirier Darrell A.Simultaneous multi-user real-time voice recognition system
US20020059204 *10 Jul 200116 May 2002Harris Larry R.Distributed search system and method
US20020133477 *5 Mar 200119 Sep 2002Glenn AbelMethod for profile-based notice and broadcast of multimedia content
US20030051214 *6 Aug 200213 Mar 2003Ricoh Company, Ltd.Techniques for annotating portions of a document relevant to concepts of interest
US20030088414 *7 May 20028 May 2003Chao-Shih HuangBackground learning of speaker voices
US20030093580 *9 Nov 200115 May 2003Koninklijke Philips Electronics N.V.Method and system for information alerts
US20030167163 *31 Jul 20024 Sep 2003Nec Research Institute, Inc.Inferring hierarchical descriptions of a set of documents
US20040024739 *1 Jul 20035 Feb 2004Kanisa Inc.System and method for implementing a knowledge management system
US20040073444 *16 Nov 200115 Apr 2004Li Li PehMethod and apparatus for a financial database structure
US20050060162 *12 Jun 200117 Mar 2005Farhad MohitSystems and methods for automatic identification and hyperlinking of words or other data items and for information retrieval using hyperlinked words or data items
US20060129541 *22 Dec 200515 Jun 2006Microsoft CorporationDynamically updated quick searches and strategies
Referenced by
Citing PatentFiling datePublication dateApplicantTitle
US7801893 *30 Sep 200521 Sep 2010Iac Search & Media, Inc.Similarity detection and clustering of images
US8812503 *4 Feb 201119 Aug 2014Sony CorporationInformation processing device, method and program
US8983958 *21 Dec 200917 Mar 2015Business Objects Software LimitedDocument indexing based on categorization and prioritization
US9002848 *22 Jun 20127 Apr 2015Google Inc.Automatic incremental labeling of document clusters
US9495439 *8 Oct 201315 Nov 2016Cisco Technology, Inc.Organizing multimedia content
US20060058998 *12 Aug 200516 Mar 2006Kabushiki Kaisha ToshibaIndexing apparatus and indexing method
US20070078846 *30 Sep 20055 Apr 2007Antonino GulliSimilarity detection and clustering of images
US20090043797 *28 Jul 200812 Feb 2009Sparkip, Inc.System And Methods For Clustering Large Database of Documents
US20110153589 *21 Dec 200923 Jun 2011Ganesh VaitheeswaranDocument indexing based on categorization and prioritization
US20110202530 *4 Feb 201118 Aug 2011Sony CorporationInformation processing device, method and program
US20120174007 *29 Dec 20115 Jul 2012Seungwon LeeMobile terminal and method of grouping applications thereof
US20140067812 *13 Nov 20136 Mar 2014Rogers Communications Inc.Systems and methods for ranking document clusters
US20140207783 *22 Jan 201424 Jul 2014Equivio Ltd.System and method for computerized identification and effective presentation of semantic themes occurring in a set of electronic documents
US20150100582 *8 Oct 20139 Apr 2015Cisco Technology, Inc.Association of topic labels with digital content
US20150100583 *8 Oct 20139 Apr 2015Cisco Technology, Inc.Method and apparatus for organizing multimedia content
WO2009018223A1 *28 Jul 20085 Feb 2009Sparkip, Inc.System and methods for clustering large database of documents
WO2012033873A1 *8 Sep 201115 Mar 2012Icosystem CorporationMethods and systems for online advertising with interactive text clouds
Classifications
U.S. Classification715/230
International ClassificationG06F17/20, G06F17/21, G10L15/00, G10L13/08, G10L15/14, G10L11/00, G06F17/00
Cooperative ClassificationG10L15/32, G10L15/28
Legal Events
DateCodeEventDescription
16 Oct 2003ASAssignment
Owner name: BBNT SOLUTIONS LLC, TEXAS
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:COLBATH, SEAN;KUBALA, FRANCIS G.;REEL/FRAME:014608/0961;SIGNING DATES FROM 20031001 TO 20031003
12 May 2004ASAssignment
Owner name: FLEET NATIONAL BANK, AS AGENT, MASSACHUSETTS
Free format text: PATENT & TRADEMARK SECURITY AGREEMENT;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:014624/0196
Effective date: 20040326
Owner name: FLEET NATIONAL BANK, AS AGENT,MASSACHUSETTS
Free format text: PATENT & TRADEMARK SECURITY AGREEMENT;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:014624/0196
Effective date: 20040326
24 Sep 2004ASAssignment
Owner name: BBNT SOLUTIONS LLC, MASSACHUSETTS
Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNEE S ADDRESS, PREVIOUSLY RECORDED AT REEL 014608 FRAME 0961;ASSIGNORS:COLBATH, SEAN;KUBALA, FRANCIS G.;REEL/FRAME:015815/0330;SIGNING DATES FROM 20031001 TO 20031003
2 Mar 2006ASAssignment
Owner name: BBN TECHNOLOGIES CORP., MASSACHUSETTS
Free format text: MERGER;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:017274/0318
Effective date: 20060103
Owner name: BBN TECHNOLOGIES CORP.,MASSACHUSETTS
Free format text: MERGER;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:017274/0318
Effective date: 20060103
27 Oct 2009ASAssignment
Owner name: BBN TECHNOLOGIES CORP. (AS SUCCESSOR BY MERGER TO
Free format text: RELEASE OF SECURITY INTEREST;ASSIGNOR:BANK OF AMERICA, N.A. (SUCCESSOR BY MERGER TO FLEET NATIONAL BANK);REEL/FRAME:023427/0436
Effective date: 20091026