Search Images Maps Play YouTube News Gmail Drive More »
Sign in
Screen reader users: click this link for accessible mode. Accessible mode has the same essential features but works better with your reader.

Patents

  1. Advanced Patent Search
Publication numberUS20040024585 A1
Publication typeApplication
Application numberUS 10/610,696
Publication date5 Feb 2004
Filing date2 Jul 2003
Priority date3 Jul 2002
Also published asUS7290207, US7337115, US7801838, US8001066, US20040006481, US20040006576, US20040006737, US20040006748, US20040024582, US20040024598, US20040030550, US20040117188, US20040199495, US20110004576
Publication number10610696, 610696, US 2004/0024585 A1, US 2004/024585 A1, US 20040024585 A1, US 20040024585A1, US 2004024585 A1, US 2004024585A1, US-A1-20040024585, US-A1-2004024585, US2004/0024585A1, US2004/024585A1, US20040024585 A1, US20040024585A1, US2004024585 A1, US2004024585A1
InventorsAmit Srivastava, Francis Kubala
Original AssigneeAmit Srivastava, Francis Kubala
Export CitationBiBTeX, EndNote, RefMan
External Links: USPTO, USPTO Assignment, Espacenet
Linguistic segmentation of speech
US 20040024585 A1
Abstract
A linguistic segmentation tool (115) includes an acoustic feature extraction component (302) and a lexical feature extraction component (311). The acoustic feature extraction component (302) extracts prosodic features from speech (e.g., pauses, pitch, energy, and rate). The lexical feature extraction component (311) extracts lexical features from a transcribed version of the speech (e.g., words, syntactic classifications of the words, and word structure). A language model is constructed based on the lexical features and an acoustic model is constructed based on the acoustic features. A statistical framework combines the outputs of the language model to generate indications of potential linguistic features.
Images(7)
Previous page
Next page
Claims(33)
What is claimed:
1. A linguistic segmentation tool comprising:
a lexical feature extraction component configured to receive text and generate lexical feature vectors relating to the text, the lexical feature vectors including words from the text and syntactic classes of the words;
an acoustic feature extraction component configured to receive an audio version of the text and generate acoustic feature vectors relating to the audio version of the text; and
a statistical framework component configured to generate linguistic features associated with the text based on the acoustic feature vectors and the lexical feature vectors.
2. The linguistic segmentation tool of claim 1, wherein the linguistic features include periods, quotation marks, exclamation marks, commas, and phrasal boundaries.
3. The linguistic segmentation tool of claim 1, further comprising:
a transcription component configured to generate the text based on the audio version of the text.
4. The linguistic segmentation tool of claim 1, wherein the statistical framework includes:
an acoustic model configured to estimate a probability of an occurrence of the linguistic features based on the acoustic feature vectors.
5. The linguistic segmentation tool of claim 4, wherein the statistical framework includes:
a language model configured to estimate a probability that one of the lexical feature vectors corresponds to a text boundary.
6. The linguistic segmentation tool of claim 5, wherein the statistical framework includes:
a maximum likelihood estimator configured to generate the linguistic features based on the probabilities generated by the acoustic model and the language model.
7. The linguistic segmentation tool of claim 1, wherein the lexical feature vectors additionally include an identification of a structured speech member of the word.
8. The linguistic segmentation tool of claim 1, wherein the acoustic feature vectors are based on prosodic features including at least one of pause, rate, energy, and pitch.
9. The linguistic segmentation tool of claim 1, wherein the syntactic classes are indicative of a role of the word in the text.
10. The linguistic segmentation tool of claim 9, wherein the syntactic classes include syntactic classes based on affixes of the words.
11. The linguistic segmentation tool of claim 10, wherein the syntactic classes include syntactic classes based on frequently occurring words.
12. A method for determining linguistic information for words corresponding to a transcribed version of an audio input stream including speech, the method comprising:
generating lexical features for the words, including a syntactic class associated with at least one of the words;
generating acoustic features for the audio input stream, the acoustic features being based on at least one of speaker pauses, speaker rate, speaker energy, and speaker pitch; and
generating the linguistic information based on the lexical features and the acoustic features.
13. The method of claim 12, further comprising:
automatically transcribing the audio input stream to generate the words corresponding to the transcribed version of the speech.
14. The method of claim 12, further comprising:
creating a language model configured to estimate a probability that the lexical features correspond to a word boundary based on the lexical features.
15. The method of claim 14, further comprising:
creating an acoustic model configured to estimate a probability of an occurrence of the linguistic information based on the acoustic features.
16. The method of claim 15, wherein generating the linguistic information based on the lexical features and the acoustic features includes using a maximum likelihood estimator configured to estimate a final probability of an occurrence of the linguistic information based on the probabilities generated by the acoustic model and the language model.
17. The method of claim 12, wherein the syntactic class is indicative of the role of the at least one of the words.
18. The method of claim 12, wherein the syntactic class is based on affixes of the words.
19. The method of claim 12, wherein the syntactic class is based on word frequency.
20. The method of claim 12, wherein the linguistic information includes periods, quotation marks, exclamation marks, commas, and phrasal boundaries.
21. A computing device for determining linguistic information for words corresponding to a transcribed version of an audio input stream that includes speech, the computing device comprising:
a processor; and
a computer memory coupled to the processor and containing programming instructions that when executed by the processor cause the processor to:
generate lexical features for the words, including a syntactic class associated with at least one of the words,
generate acoustic features for the audio input stream, the acoustic features being based on at least one of speaker pauses, speaker rate, speaker energy, and speaker pitch,
generate the linguistic information based on the lexical features and the acoustic features, and
output the generated linguistic information as meta-information embedded in the transcribed version of the audio input stream.
22. The computing device of claim 21, wherein the syntactic class is indicative of the role of the at least one of the words.
23. The computing device of claim 21, wherein the syntactic class is based on affixes of the words.
24. The computing device of claim 21, wherein the syntactic class is based on word frequency.
25. A method for associating meta-information with a document transcribed from speech, the method comprising:
building a language model based on lexical feature vectors extracted from the document, the lexical feature vectors including words and syntactic classifications of the words;
building an acoustic model based on acoustic feature vectors extracted from the speech; and
combining outputs of the language model and the acoustic model in a statistical framework that estimates a probability for associating the meta-information with the document.
26. The method of claim 25, wherein the meta-information relates to linguistic features of the document.
27. The method of claim 26, wherein the linguistic features include periods, quotation marks, exclamation marks, commas, and phrasal boundaries.
28. The method of claim 25, wherein the acoustic feature vectors are based on prosodic features including pause, rate, energy, and pitch.
29. The method of claim 25, wherein the syntactic class is indicative of the role of the at least one of the words.
30. The method of claim 25, wherein the syntactic class is based on affixes of the words.
31. The method of claim 25, wherein the syntactic class is based on word frequency.
32. A device comprising:
means for building a language model based on lexical feature vectors extracted from a document transcribed from human speech, the lexical feature vectors including a word and a syntactic classification of the word;
means for building an acoustic model based on acoustic feature vectors extracted from the speech; and
means for combining outputs of the language model and the acoustic model to estimate a probability for associating a linguistic feature with the document.
33. A computer-readable medium containing program instructions for execution by a processor, the program instructions, when executed by the processor, cause the processor to perform a method comprising:
generating lexical features for words corresponding to a transcribed version of speech, the lexical features including a syntactic class associated with at least one of the words;
generating acoustic features for the speech, the acoustic features based on at least one of speaker pauses, speaker rate, speaker energy, and speaker pitch; and
generating linguistic information for the words based on the lexical features and the acoustic features.
Description
    RELATED APPLICATIONS
  • [0001]
    This application claims priority under 35 U.S.C. 119 based on U.S. Provisional Application Nos. 60/394,064 and 60/394,082 filed Jul. 3, 2002 and Provisional Application No. 60/419,214 filed Oct. 17, 2002, the disclosures of which are incorporated herein by reference.
  • GOVERNMENT INTEREST
  • [0002] The U.S. Government has a paid-up license in this invention as provided by the terms of contract No. N66001-00-C-8008 awarded by the Defense Advanced Research Projects Agency (DARPA).
  • BACKGROUND OF THE INVENTION
  • [0003]
    A. Field of the Invention
  • [0004]
    The present invention relates generally to speech processing and, more particularly, to linguistic segmentation of transcribed speech.
  • [0005]
    B. Description of Related Art
  • [0006]
    Speech has not traditionally been valued as an archival information source. As effective as the spoken word is for communicating, archiving spoken segments in a useful and easily retrievable manner has long been a difficult proposition. Although the act of recording audio is not difficult, automatically transcribing and indexing speech in an intelligent and useful manner can be difficult.
  • [0007]
    Speech is typically received into a speech recognition system as a continuous stream of words. In order to effectively use the speech in information management systems (e.g., information retrieval, natural language processing, real-time alerting), the speech recognition system initially transcribes the speech to generate a textual document. A simple transcription, however, will generally not contain significant information that was present in the original speech. For example, the transcription may be a mere stream of words that lack many of the linguistic features that a listener of the speech would normally identify.
  • [0008]
    Linguistic features that may be lacking in a simple transcription of speech include linguistic features that are visible in text, such as periods, quotation marks, exclamation marks, commas, and direct quotation marks. Additionally, linguistic features may include non-visible information, such as phrasal boundaries.
  • [0009]
    There is a need in the art to be able to automatically generate linguistic information, including visible and non-visible linguistic features, for audio input streams.
  • SUMMARY OF THE INVENTION
  • [0010]
    Systems and methods consistent with the principles of this invention provide a linguistic segmentation tool that generates a comprehensive set of linguistic information for a document transcribed based on human speech.
  • [0011]
    One aspect of the invention is directed to a linguistic segmentation tool. The linguistic segmentation tool includes a lexical feature extraction component configured to receive text and generate lexical feature vectors relating to the text. The linguistic segmentation tool further includes an acoustic feature extraction component that receives a spoken version of the text and generates acoustic feature vectors relating to the spoken version of the text. Finally, the linguistic segmentation tool includes a statistical framework component configured to generate linguistic features associated with the text based on the acoustic feature vectors and the lexical feature vectors.
  • [0012]
    A second aspect of the invention is directed to a method for determining linguistic information for words corresponding to a transcribed version of speech. The method includes generating lexical features for the words, including a syntactic class associated with the words and generating acoustic features for the speech. The acoustic features are based on speaker pauses, speaker rate, speaker energy, and/or speaker pitch. The method further includes generating the linguistic information based on the lexical features and the acoustic features.
  • [0013]
    Yet another aspect consistent with the invention is directed to a computing device for determining linguistic information for words corresponding to a transcribed version of speech. The computing device includes a processor and a computer memory coupled to the processor and containing programming instructions. The program instructions, when executed by the processor, cause the processor to generate lexical features for the words, including a syntactic class associated with at least one of the words, and generate acoustic features for the speech. The acoustic features are based on speaker pauses, speaker rate, speaker energy, and/or speaker pitch. The program instructions further cause the processor generate the linguistic information based on the lexical features and the acoustic features, and output the generated linguistic information as meta-information embedded in the transcribed version of the speech.
  • [0014]
    Yet another aspect of the invention is a method for associating meta-information with a document transcribed from speech. The method includes building a language model based on lexical feature vectors extracted from the document, where the lexical feature vectors include a word and a syntactic classification of the word. The method further includes building an acoustic model based on acoustic feature vectors extracted from the speech and combining outputs of the language model and the acoustic model in a statistical framework that estimates a probability for associating the meta-information with the document.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • [0015]
    The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate the invention and, together with the description, explain the invention. In the drawings,
  • [0016]
    [0016]FIG. 1 is a diagram illustrating an exemplary system in which concepts consistent with the invention may be implemented;
  • [0017]
    [0017]FIG. 2 is a block diagram conceptually illustrating linguistic segmentation;
  • [0018]
    [0018]FIG. 3 is a block diagram of a linguistic segmentation tool consistent with the present invention;
  • [0019]
    [0019]FIG. 4 is a diagram illustrating a series of words;
  • [0020]
    [0020]FIG. 5 is a flow chart illustrating methods for assigning a syntactic class to a word; and
  • [0021]
    [0021]FIG. 6 is a flow chart illustrating methods for estimating the probability of the occurrence of a linguistic feature.
  • DETAILED DESCRIPTION
  • [0022]
    The following detailed description of the invention refers to the accompanying drawings. The same reference numbers may be used in different drawings to identify the same or similar elements. Also, the following detailed description does not limit the invention. Instead, the scope of the invention is defined by the appended claims and equivalents of the claim limitations.
  • [0023]
    Linguistic segmentation of spoken audio is performed by a linguistic segmentation tool based on a transcribed version of the speech and the original speech. The linguistic segmentation tool analyzes both lexical and acoustical features of the speech in generating the linguistic segments. The lexical features include syntactic classifications of the words in the transcribed text. The acoustical features include measured pauses, speaking rate, speaker energy, and speaker pitch. Speech models based on the acoustic and lexical features are combined to achieve a final probability of a particular linguistic feature occurring.
  • [0024]
    System Overview
  • [0025]
    Linguistic segmentation, as described herein, may be performed on one or more processing devices or networks of processing devices. FIG. 1 is a diagram illustrating an exemplary system 100 in which concepts consistent with the invention may be implemented. System 100 includes a computing device 101 that has a computer-readable medium 109, such as random access memory, coupled to a processor 108. Computing device 101 may also include a number of additional external or internal devices, such as, without limitation, a mouse, a CD-ROM, a keyboard, and a display.
  • [0026]
    In general, computing device 101 may be any type of computing platform, and may be connected to a network 102. Computing device 101 is exemplary only. Concepts consistent with the present invention can be implemented on any computing device, whether or not connected to a network.
  • [0027]
    Processor 108 can be any of a number of well-known computer processors, such as processors from Intel Corporation, of Santa Clara, Calif. Processor 108 executes program instructions stored in memory 109.
  • [0028]
    Memory 109 contains an application program 115. In particular, application program 115 may implement the linguistic segmentation tool described below. The linguistic segmentation tool 115 may receive input data, such as linguistically segmented text, from other application programs executing in computing device 101 or in other computing devices, such as those connected to computing device 101 through network 102. Linguistic segmentation tool 115 processes the input data to generate indications of linguistic features.
  • [0029]
    Linguistic Segmentation Tool
  • [0030]
    [0030]FIG. 2 is a block diagram conceptually illustrating linguistic segmentation as performed by linguistic segmentation tool 115. An audio input stream having speech is initially processed as a lexical model 201 and an acoustic model 202. Lexical model 201 operates on a number of lexical feature vectors that describe lexical features of the input speech. Acoustic model 202 operates on prosodic features, such as, for example, pauses, speaker rate, energy, and pitch.
  • [0031]
    The prosodic features and the lexical features are combined by statistical framework 203 to generate the linguistic features. The linguistic features may be generally referred to as meta-information that further defines meaning beyond the plain words in the input speech. The meta-information may include visible meta-information that is normally associated with written documents, such as period marks, quotation mark, exclamation marks, commas, and direct quotation marks. Additionally, meta-information that is invisible to the traditional written document, such as phrasal boundaries and structured speech locations, may also be generated by statistical framework 203. The complete output document, including the plain words of the document and the linguistic meta-information, is output by statistical framework 203 and may be used by other information management systems (e.g., information retrieval and natural language processing systems) that add value to the archived speech.
  • [0032]
    [0032]FIG. 3 is a block diagram illustrating elements of linguistic segmentation tool 115 in additional detail. Segmentation tool 115 includes a speech recognition system 301, an acoustic feature extraction component 302, and a statistical framework component 303.
  • [0033]
    Speech recognition system 301 includes transcription component 310 and lexical feature extraction component 311.
  • [0034]
    Transcription component 310 receives the input audio stream and generates a textual transcription of the audio stream. Transcription component 310 may be an automated transcription tool or its output text may be generated through a manual transcription process. The output of transcription component 310 is received by lexical feature extraction component 311. Lexical feature extraction component 311 generates the lexical feature vectors that describe the lexical features of the input speech (described in more detail below).
  • [0035]
    Acoustic feature extraction component 302 generates acoustic feature vectors based on the input audio information (described in more detail below). The acoustic feature vectors describe prosodic information from the input audio.
  • [0036]
    Statistical framework component 303 receives the lexical and acoustic vectors from lexical feature extraction component 311 and acoustic feature extraction component 302, respectively, and based on these vectors, generates a language model (LM) 315 and an acoustic model (AM) 316 for the speech. Statistical framework component 303 combines the outputs of these models to generate the final lexical features. Statistical framework component 303 may output a linguistically segmented document, which includes the originally transcribed text with meta-information describing the linguistic features.
  • [0037]
    Acoustic Feature Extraction
  • [0038]
    Acoustic feature extraction component 302 extracts acoustic feature vectors that correspond to boundaries between words. FIG. 4 is a diagram illustrating a series of words (labeled as words w1, w2, w3, w4). In one implementation, an acoustic feature vector is generated at each word boundary, labeled as boundaries 401.
  • [0039]
    Each acoustic feature can be thought of as a function based on the acoustic information to the left of a particular boundary 401 (InfoL) and the acoustic information to the right of a particular boundary 401 (InfoR). In other words, an acoustic feature for a particular boundary is defined as
  • f(InfoL, InfoR),
  • [0040]
    where f indicates the function. In one implementation, this function may be implemented as a difference operation. In an alternate implementation, this function may be implemented as log(InfoL/InfoR).
  • [0041]
    The information assigned to InfoL and InfoR may be based on four basic prosodic features: (1) speaker pauses (e.g., pause duration), (2) speaker rate (e.g., duration of vowels; either the absolute value of vowel durations or differences in vowel durations), (3) speaker energy (signal energy), and (4) pitch (e.g., absolute pitch values or changes in pitch). In one implementation, function f is applied 26 times to 26 different combinations of InfoL and InfoR that are selected from prosodic features (1)-(4). In this manner, for each boundary 401 acoustic feature extraction component 302 generates an acoustic feature vector containing 26 acoustic features.
  • [0042]
    One of ordinary skill in the art will recognize that in alternate implementations, more or less than 26 acoustic features can be used in a single acoustic vector.
  • [0043]
    Lexical Feature Extraction
  • [0044]
    Lexical feature extraction component 311 generates lexical feature vectors for the series of words it receives from transcription component 310. A lexical feature vector includes an indication of a word and a syntactic class of the word (described below). Other features may be included in the lexical feature vector, such as the structured speech member of the word (e.g., whether the word is a proper name, a number, an email address, or a URL).
  • [0045]
    The syntactic class of a word is an indication of the role of the word relative to its surrounding words. For example, possible syntactic classes may indicate whether a word tends to start a sentence, connect phrases together, or end phrases. In one implementation, potential syntactic classes are automatically generated by lexical feature extraction component 311.
  • [0046]
    [0046]FIG. 5 is a flow chart illustrating methods for assigning a syntactic class to a word.
  • [0047]
    A first set of syntactic classes is based on word affixes. The suffix or prefix of a word often implies the role of the word. Lexical feature component 311 stores a list of word suffixes/prefixes that are known to have a strong probability of implying the role of the word. The list of suffixes/prefixes may be manually generated by a human expert or the list may be automatically learned by feature extraction component 311 from training document(s). In one implementation, for the English language, approximately 30-40 suffixes/prefixes are used, corresponding to 30-40 suffix/prefix classes.
  • [0048]
    For each word, lexical feature component 311 begins by determining if the word has a suffix or prefix that matches the predefined list of suffixes/prefixes (act 501). If so, the word is assigned a syntactic class corresponding to its suffix/prefix (act 502).
  • [0049]
    In addition to assigning syntactic classes based on suffixes/prefixes, lexical feature component 311 assigns classes based on a predefined set of “function words.” The list of function words is based on word frequency. For example, in one implementation, the approximately 2000 most frequently occurring words in a language may be considered function words. If the word being examined by lexical feature component 311 is one of these function words, (act 503), the word is assigned a syntactic class corresponding to the function word (act 504).
  • [0050]
    Lexical feature component 311 assigns words that do not have matching suffixes/prefixes and are not function words to an undefined “catch-all” syntactic class (act 505).
  • [0051]
    In the manner described above, words are assigned to one of approximately 2030 classes (30 suffix/prefix classes, 2000 function word classes, one catch-all class) by lexical feature component 311. The class assignment, along with the word itself, and possibly along with other lexical features, such as the structured speech member of the word, defines the word's lexical feature vector.
  • [0052]
    Generation of Linguistic Segmentation Information
  • [0053]
    Statistical framework component 303 receives the acoustic feature vectors from acoustic feature extraction component 302 and the lexical feature vectors from lexical feature component 311. More specifically, statistical framework component 303 may construct a language model 315 (LM) based on the lexical feature vectors and an acoustic model 316 (AM) based on the acoustic feature vectors. The language model and the acoustic model are combined using maximum likelihood estimation techniques to obtain a final probability that a particular one of the linguistic features (e.g., period, exclamation, phrasal boundary, etc.) is present at the location corresponding to the acoustic and lexical feature vector.
  • [0054]
    In one implementation, language model 315 is a tri-gram model that estimates the probability of a particular language vector corresponding to a word boundary based on the present language vector and the two previous language vectors. Stated more formally, the language model 315 may be defined as:
  • LM: P(LVi|LVi−1, LVi−2),
  • [0055]
    where P is the probability of a boundary at the ith language vector (LV) given the previous two language vectors (LVi−1, and LVi−2).
  • [0056]
    Acoustic model 316 may include an acoustic model that estimates the probability of occurrence of each of the potential linguistic features. In one implementation, acoustic model 316 may include a neural network, such as a three layer neural network having 26 input nodes (one for each acoustic feature in the acoustic vector), 75 hidden layer nodes, and a single output node. The neural network may be trained as a conventional feed-forward back-propagation neural network. The value at the output is a score that signifies how likely it is that the particular linguistic feature is present. For example, acoustic model 316 may include a neural network trained to output a score indicating whether its input acoustic feature vector corresponds to a linguistic feature, such as a period. Additional neural networks may be trained for additional linguistic features (e.g., quotation marks, exclamation marks, commas, invisible phrasal boundaries). The scores from the neural networks are assumed to have a Gaussian distribution, which allows acoustic model 316 to convert the scores to a probability using the conventional Gaussian density function.
  • [0057]
    [0057]FIG. 6 is a flow chart illustrating methods for estimating a probability that one of the linguistic features is present at the location corresponding to the acoustic and lexical feature vector. Language model 315 estimates, based on the lexical vectors, the probability of a particular lexical vector corresponding to a boundary (act 601). Acoustic model 316 estimates the probability of the potential linguistic features occurring (act 602). Finally, statistical framework 303 combines the probabilities output from the language model 315 and the acoustic model 316 to generate the final probability that the linguistic feature is present (act 603). Statistical framework 303 may estimate this probability using maximum likelihood estimation (MLE) techniques. MLE techniques are well known in the art.
  • [0058]
    Conclusion
  • [0059]
    As described herein, a linguistic segmentation tool 115 generates linguistic features for a transcribed document. The linguistic features are associated with the original document as meta-information that enriches the content of the document.
  • [0060]
    The foregoing description of preferred embodiments of the invention provides illustration and description, but is not intended to be exhaustive or to limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. Moreover, while a series of acts have been presented with respect to FIGS. 5 and 6, the order of the acts may be different in other implementations consistent with the present invention.
  • [0061]
    Certain portions of the invention have been described as software that performs one or more functions. The software may more generally be implemented as any type of logic. This logic may include hardware, such as application specific integrated circuit or a field programmable gate array, software, or a combination of hardware and software.
  • [0062]
    No element, act, or instruction used in the description of the present application should be construed as critical or essential to the invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used.
  • [0063]
    The scope of the invention is defined by the claims and their equivalents.
Patent Citations
Cited PatentFiling datePublication dateApplicantTitle
US5317732 *26 Apr 199131 May 1994Commodore Electronics LimitedSystem for relocating a multimedia presentation on a different platform by extracting a resource map in order to remap and relocate resources
US5404295 *4 Jan 19944 Apr 1995Katz; BorisMethod and apparatus for utilizing annotations to facilitate computer retrieval of database material
US5404520 *3 May 19934 Apr 1995Fujitsu LimitedData input/output control system
US5418716 *26 Jul 199023 May 1995Nec CorporationSystem for recognizing sentence patterns and a system for recognizing sentence patterns and grammatical cases
US5559875 *31 Jul 199524 Sep 1996Latitude CommunicationsMethod and apparatus for recording and retrieval of audio conferences
US5613032 *2 Sep 199418 Mar 1997Bell Communications Research, Inc.System and method for recording, playing back and searching multimedia events wherein video, audio and text can be searched and retrieved
US5614940 *21 Oct 199425 Mar 1997Intel CorporationMethod and apparatus for providing broadcast information with indexing
US5806032 *14 Jun 19968 Sep 1998Lucent Technologies Inc.Compilation of weighted finite-state transducers from decision trees
US5960447 *13 Nov 199528 Sep 1999Holt; DouglasWord tagging and editing system for speech recognition
US6006184 *28 Jan 199821 Dec 1999Nec CorporationTree structured cohort selection for speaker recognition system
US6024751 *11 Apr 199715 Feb 2000Coherent Inc.Method and apparatus for transurethral resection of the prostate
US6067514 *23 Jun 199823 May 2000International Business Machines CorporationMethod for automatically punctuating a speech utterance in a continuous speech recognition system
US6067517 *2 Feb 199623 May 2000International Business Machines CorporationTranscription of speech data with segments from acoustically dissimilar environments
US6073096 *4 Feb 19986 Jun 2000International Business Machines CorporationSpeaker adaptation system and method based on class-specific pre-clustering training speakers
US6076053 *21 May 199813 Jun 2000Lucent Technologies Inc.Methods and apparatus for discriminative training and adaptation of pronunciation networks
US6161087 *5 Oct 199812 Dec 2000Lernout & Hauspie Speech Products N.V.Speech-recognition-assisted selective suppression of silent and filled speech pauses during playback of an audio recording
US6253179 *29 Jan 199926 Jun 2001International Business Machines CorporationMethod and apparatus for multi-environment speaker verification
US6308222 *30 Nov 199923 Oct 2001Microsoft CorporationTranscoding of audio data
US6345252 *9 Apr 19995 Feb 2002International Business Machines CorporationMethods and apparatus for retrieving audio information using content and speaker information
US6347295 *26 Oct 199812 Feb 2002Compaq Computer CorporationComputer method and apparatus for grapheme-to-phoneme rule-set-generation
US6360234 *14 Aug 199819 Mar 2002Virage, Inc.Video cataloger system with synchronized encoders
US6463444 *14 Aug 19988 Oct 2002Virage, Inc.Video cataloger system with extensibility
US6567980 *14 Aug 199820 May 2003Virage, Inc.Video cataloger system with hyperlinked output
US6571208 *29 Nov 199927 May 2003Matsushita Electric Industrial Co., Ltd.Context-dependent acoustic models for medium and large vocabulary speech recognition with eigenvoice training
US6611803 *14 Dec 199926 Aug 2003Matsushita Electric Industrial Co., Ltd.Method and apparatus for retrieving a video and audio scene using an index generated by speech recognition
US6624826 *28 Sep 199923 Sep 2003Ricoh Co., Ltd.Method and apparatus for generating visual representations for audio documents
US6711541 *7 Sep 199923 Mar 2004Matsushita Electric Industrial Co., Ltd.Technique for developing discriminative sound units for speech recognition and allophone modeling
US6718303 *13 May 19996 Apr 2004International Business Machines CorporationApparatus and method for automatically generating punctuation marks in continuous speech recognition
US6718305 *17 Mar 20006 Apr 2004Koninklijke Philips Electronics N.V.Specifying a tree structure for speech recognizers using correlation between regression classes
US6728673 *9 May 200327 Apr 2004Matsushita Electric Industrial Co., LtdMethod and apparatus for retrieving a video and audio scene using an index generated by speech recognition
US6732183 *4 May 20004 May 2004Broadware Technologies, Inc.Video and audio streaming for multiple users
US6748356 *7 Jun 20008 Jun 2004International Business Machines CorporationMethods and apparatus for identifying unknown speakers using a hierarchical tree structure
US6778958 *30 Aug 200017 Aug 2004International Business Machines CorporationSymbol insertion apparatus and method
US6877134 *29 Jul 19995 Apr 2005Virage, Inc.Integrated data and real-time metadata capture system and method
US7131117 *4 Sep 200231 Oct 2006Sbc Properties, L.P.Method and system for automating the analysis of word frequencies
US7146317 *22 Feb 20015 Dec 2006Koninklijke Philips Electronics N.V.Speech recognition device with reference transformation means
US7171360 *7 May 200230 Jan 2007Koninklijke Philips Electronics N.V.Background learning of speaker voices
US7257528 *13 Feb 199814 Aug 2007Zi Corporation Of Canada, Inc.Method and apparatus for Chinese character text input
US20020001261 *20 Apr 20013 Jan 2002Yoshinori MatsuiData playback apparatus
US20020049589 *27 Jun 200125 Apr 2002Poirier Darrell A.Simultaneous multi-user real-time voice recognition system
US20020133477 *5 Mar 200119 Sep 2002Glenn AbelMethod for profile-based notice and broadcast of multimedia content
US20030088414 *7 May 20028 May 2003Chao-Shih HuangBackground learning of speaker voices
Referenced by
Citing PatentFiling datePublication dateApplicantTitle
US803689322 Jul 200411 Oct 2011Nuance Communications, Inc.Method and system for identifying and correcting accent-induced speech recognition difficulties
US82855469 Sep 20119 Oct 2012Nuance Communications, Inc.Method and system for identifying and correcting accent-induced speech recognition difficulties
US9195656 *30 Dec 201324 Nov 2015Google Inc.Multilingual prosody generation
US94138918 Jan 20159 Aug 2016Callminer, Inc.Real-time conversational analytics facility
US20060020463 *22 Jul 200426 Jan 2006International Business Machines CorporationMethod and system for identifying and correcting accent-induced speech recognition difficulties
US20070078644 *30 Sep 20055 Apr 2007Microsoft CorporationDetecting segmentation errors in an annotated corpus
US20070094023 *12 Jan 200626 Apr 2007Callminer, Inc.Method and apparatus for processing heterogeneous units of work
US20140149104 *22 Nov 201329 May 2014Idiap Research InstituteApparatus and method for constructing multilingual acoustic model and computer readable recording medium for storing program for performing the method
US20160012040 *27 Aug 201514 Jan 2016Kabushiki Kaisha ToshibaData processing device and script model construction method
Classifications
U.S. Classification704/10
International ClassificationG06F17/00, G06F7/00, G10L11/00, G10L15/26, G10L15/00, G06F17/28, G10L21/00, G06F17/21
Cooperative ClassificationG10L25/78, G10L15/26, Y10S707/99943
European ClassificationG10L25/78, G10L15/26A
Legal Events
DateCodeEventDescription
2 Jul 2003ASAssignment
Owner name: BBNT SOLUTIONS LLC, TEXAS
Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SRIVASTAVA, AMIT;KUBALA, FRANCIS;REEL/FRAME:014265/0660;SIGNING DATES FROM 20030617 TO 20030618
12 May 2004ASAssignment
Owner name: FLEET NATIONAL BANK, AS AGENT, MASSACHUSETTS
Free format text: PATENT & TRADEMARK SECURITY AGREEMENT;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:014624/0196
Effective date: 20040326
Owner name: FLEET NATIONAL BANK, AS AGENT,MASSACHUSETTS
Free format text: PATENT & TRADEMARK SECURITY AGREEMENT;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:014624/0196
Effective date: 20040326
2 Mar 2006ASAssignment
Owner name: BBN TECHNOLOGIES CORP., MASSACHUSETTS
Free format text: MERGER;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:017274/0318
Effective date: 20060103
Owner name: BBN TECHNOLOGIES CORP.,MASSACHUSETTS
Free format text: MERGER;ASSIGNOR:BBNT SOLUTIONS LLC;REEL/FRAME:017274/0318
Effective date: 20060103
27 Oct 2009ASAssignment
Owner name: BBN TECHNOLOGIES CORP. (AS SUCCESSOR BY MERGER TO
Free format text: RELEASE OF SECURITY INTEREST;ASSIGNOR:BANK OF AMERICA, N.A. (SUCCESSOR BY MERGER TO FLEET NATIONAL BANK);REEL/FRAME:023427/0436
Effective date: 20091026