US20100250240A1 - System and method for training an acoustic model with reduced feature space variation - Google Patents

System and method for training an acoustic model with reduced feature space variation Download PDF

Info

Publication number
US20100250240A1
US20100250240A1 US12/413,896 US41389609A US2010250240A1 US 20100250240 A1 US20100250240 A1 US 20100250240A1 US 41389609 A US41389609 A US 41389609A US 2010250240 A1 US2010250240 A1 US 2010250240A1
Authority
US
United States
Prior art keywords
specific
renamed
combined
generating
transcription
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US12/413,896
Other versions
US8301446B2 (en
Inventor
Chang-Qing Shu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Adacel Systems Inc
Original Assignee
Adacel Systems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Adacel Systems Inc filed Critical Adacel Systems Inc
Priority to US12/413,896 priority Critical patent/US8301446B2/en
Assigned to ADACEL SYSTEMS, INC. reassignment ADACEL SYSTEMS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SHU, Chang-qing
Publication of US20100250240A1 publication Critical patent/US20100250240A1/en
Application granted granted Critical
Publication of US8301446B2 publication Critical patent/US8301446B2/en
Assigned to BANK OF MONTREAL reassignment BANK OF MONTREAL SECURITY INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ADACEL SYSTEMS, INC.
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present invention relates to speech recognition systems and methods, and more particularly, to training acoustic models for speech recognition engines.
  • a signal 1002 corresponding to speech 1004 is fed into a front end module 1006 .
  • the front end 1006 module extracts feature data 1008 from the signal 1002 .
  • the feature data 1008 is input to a decoder 1010 , which the decoder 1010 outputs as recognized speech 1012 .
  • the recognized speech 1012 is then available as an input to an application 1014 .
  • An acoustic model 1018 and a language model 1020 also supply inputs to the decoder 1010 .
  • the acoustic model 1018 also called a voice model, identifies to which phonemes the feature data 1008 most likely correlate.
  • the language model 1020 consists of the certain context dependent text elements, such as words, phrases and sentences, based on assumptions about what a user is likely to say.
  • the language model 1020 cooperates with the acoustic model 1018 to assist the decoder 1010 in further constraining the possible options for the recognized speech 1012 .
  • the acoustic model 1018 is trained by training system 1110 .
  • the training system 1110 includes a training module 1112 using a phoneme set 1114 , a dictionary 1116 and a training data set.
  • the dictionary includes a plurality of text elements, such as words and/or phrases, and their phonetic spellings using phonemes from the phoneme set.
  • the training data set include an audio file set 1118 including a plurality of audio files, such as wave files of recorded speech, and a transcription set 1120 including a plurality of transcriptions corresponding to the recorded speech in the audio files.
  • the transcriptions are grouped into a single transcription file including a plurality of lines of transcribed text, each line of transcribed text including the name of the corresponding audio file.
  • the textual content of the training data set, represented by the transcriptions is generally selected to cover a wide range of related applications.
  • the resultant acoustic model can then be “shared” by all the related applications. While this approach saves the expense of training a separate acoustic model for individual applications there is a corresponding loss in accuracy for the individual applications.
  • an acoustic model training system includes a combined phoneme set with renamed specific phonemes, a combined dictionary with renamed specific text elements and corresponding renamed phonetic spellings, an audio file set, a combined transcription set corresponding to the audio file set and including transcriptions with renamed specific text elements, and a training module configured to train the acoustic model based on the audio file set, the combined transcription set, the combined phoneme set and the combined dictionary.
  • a method of training an acoustic model includes generating a specific text element set as a subset of a general text element set corresponding to a text data set.
  • a combined phoneme set and a combined dictionary are also generated.
  • the combined phoneme set includes renamed specific phonemes corresponding to the specific text element set and the combined dictionary includes the specific text element set with phonetic spellings using the renamed specific phonemes.
  • a combined transcription set is additionally generated, the combined transcription set includes transcriptions with specific text elements from the specific text element set.
  • the acoustic model is trained using the specific phonetic element set, the specific dictionary, the specific transcription set and an audio file set.
  • the text data set is not limited to text data from a training data set for the acoustic model, and can include text data from additional and/or alternate texts.
  • the specific text element set includes the most frequently occurring text elements from the text data set. The most frequently occurring text elements can be identified using a text element distribution table sorted by occurrence within the text data set, or by other methods.
  • specific phonemes can be limited to consonant phonemes.
  • the present invention reduces feature space variation associated with specific text elements by training an acoustic model with a phoneme set, dictionary and transcription set configured to better distinguish the specific text elements and at least some specific phonemes associated therewith.
  • FIG. 1 is a plot of word distribution in a sample text data set
  • FIG. 2 is a schematic overview of a system for acoustic model training, according to an embodiment of the present invention
  • FIG. 3 is a flow diagram of a method of training an acoustic model, according to a method aspect of the present invention
  • FIG. 4 is a portion of an exemplary word distribution table for use in connection with the method of FIG. 3 ;
  • FIG. 5 is a portion of an exemplary dictionary for use in connection with the method of FIG. 3 ;
  • FIG. 6 is a chart of phoneme sets for use in connection with the method of FIG. 3 ;
  • FIG. 7 is a portion of another exemplary dictionary for use in connection with the method of FIG. 3 ;
  • FIG. 8 is a schematic overview of a typical speech recognition engine.
  • FIG. 9 is a schematic overview of a typical acoustic model training system.
  • the sample text data set includes 17,846 words with 1,056,134 tokens.
  • a “token” is any occurrence of a text element (in this example, a word) within the text data set, regardless of whether the text element has occurred before. For instance, the word “the” might occur 83,000 times in the text data set. Thus, the word “the” contributes 83,000 tokens to the text data set.
  • the x-axis represents the words in the sample text data set sorted into a distribution table from most frequently occurring to least frequently occurring.
  • the first word in the table is the most frequently occurring and the 17,846 th word in the table is the least frequently occurring.
  • the y-axis represents the cumulative occurrences of the words, as a percentage of the total number of tokens in the text data set. Thus, if the first word contributes 6% of the tokens and the second word contributes 4% of the tokens, the cumulative occurrence of the first word is 6%, the cumulative occurrence of the first two words is 10%, and so on. The cumulative occurrence of all 17,846 words is 100%.
  • Existing acoustic model training does not adequately differentiate between more and less frequently occurring text elements, such as words, in a data set. Because the same phoneme set is used for all text elements, the variation in feature space for infrequently occurring text elements can adversely affect the variation in feature space for the most frequently occurring text elements. By effectively differentiating specific text elements during acoustic model training, the recognition error rate for those specific text elements can be significantly reduced. Moreover, if the specific text elements are selected from among the most frequently occurring text elements in the text data set, the overall accuracy of the speech recognition engine can be increased significantly.
  • an acoustic model training system 10 includes a training module 12 that receives inputs from a combined phoneme set 14 , a combined dictionary 16 , a combined transcription set 18 and an audio file set 20 .
  • the system 10 further includes a feature space reduction module 30 that generates the combined phoneme set 14 , combined dictionary 16 , and combined transcription set 18 based on inputs from an ordinate phoneme set 32 , a general dictionary 34 and a general transcription set 36 .
  • the feature space reduction module 30 can also accept inputs of additional text data 38 and weighting coefficients 40 .
  • speech recognition engines are inherently machine processor-based. Accordingly, the systems and methods herein are realized by at least one processor executing machine-readable code and that inputs to and outputs of the system or method are stored, at least temporarily, in some form of machine-readable memory.
  • the present invention is not necessarily limited to particular processor types, numbers or designs, to particular code formats or languages, or to particular hardware or software memory media.
  • the combined phoneme set 14 includes ordinate phonemes and renamed specific phonemes.
  • the renamed specific phonemes advantageously correspond to specific text elements selected from frequently used text elements in a text data set.
  • the text data set can include, for example, the text data from the transcription set and/or the additional text data.
  • the renamed specific phonemes correspond to only a portion of the phonemes required to phonetically spell the specific text elements.
  • the renamed specific phonemes can include only phonemes corresponding to consonants.
  • the combined dictionary 16 includes all the text elements in the training data set and renamed specific text elements corresponding to the specific text elements.
  • the combined dictionary 16 further includes phonetic spellings for all the text elements therein.
  • the spellings for the renamed specific text elements include the renamed specific phonemes. It will be appreciated that the spellings for the specific text elements can also include some ordinate phonemes. For example, if the renamed specific phonemes include only phonemes corresponding to consonants, the vowel phonemes in the spellings for the renamed specific text elements will be ordinate phonemes.
  • the combined transcription set 18 includes transcriptions of the recorded speech in the audio file set, including transcriptions having renamed specific text elements therein, as well as transcriptions without any renamed specific text elements.
  • the audio files 20 are listed in an audio file list and the transcriptions are included in a transcription file.
  • the training module 12 trains the acoustic model according to various methods known in the art.
  • the training module 12 can be configured to identify context-dependent phonemes and sub-phonemes, as well as developing other acoustic model attributes.
  • a method 100 for training an acoustic model for a speech recognition engine begins at block 102 .
  • a text element distribution table is generated from a text data set.
  • the text data set can include, but is not limited to, text data from the transcriptions in a training data set for the acoustic model.
  • the text data set can also include text data from alternate or additional texts.
  • the distribution table is generated by determining the occurrences of each text element within the training data set and sorting the text elements by the occurrence.
  • the cumulative occurrence of the text elements is also determined, beginning from the most frequently occurring text element in distribution the table.
  • the top 22 words are shown from a text data set with 1056 words and 132,008 tokens.
  • the top 22 words 2.1% of the total number of words, have a cumulative occurrence of 50.11%.
  • the ten words for single digit numbers (e.g., zero, one, two, etc.—shaded in the FIG. 4 ) have a cumulative occurrence of 32.91%.
  • TE s is a subset of a general text element set (TE 1 ), such as all the words in the text data set.
  • TE 1 includes the ten words for single digit numbers from TE 1 , where TE 1 includes all 1056 words.
  • TE s is selected to include a low number of the most frequently occurring text elements whose spellings require only a small number of phonemes.
  • an intermediate specific phoneme (PH s ) set is generated.
  • PH s is a subset of an ordinate phoneme set (PH 1 ) and includes only those phonemes needed for phonetic spellings of the text elements in TE s .
  • the phonetic spellings of the text elements in TE s are determined from a general dictionary (DICT 1 ), including TE 1 and the corresponding phonetic spellings.
  • DICT 1 entries for the single digit numbers is seen in FIG. 5 .
  • PH 1 includes 41 ordinate phonemes, of which only 20 are required for PH s .
  • consonant phonemes are particularly significant to speech recognition accuracy.
  • the size of PH s is further reduced by selecting only phonemes from PH s for consonants to generate a further intermediate specific phoneme set (PH ss ). From FIG. 6 , it will be appreciated that PH ss used only 10 phonemes from PH s .
  • a renamed specific phoneme set PH sr is generated.
  • PH sr is generated by assigning a unique, new name to each phoneme in PH ss .
  • the “B” phoneme is renamed “DB”, which new name does not occur elsewhere in PH 1 or PH sr .
  • a combined phoneme set (PH com ) is generated by combining PH 1 and PH sr .
  • PH com includes 51 phonemes; the 41 ordinate phonemes from PH 1 and 10 renamed specific phonemes from PH sr .
  • DICT s is a subset of DICT 1 including only the text elements from TE s and the corresponding phonetic spellings. Thus, in the current example, DICT s includes only those entries of DICT 1 seen in FIG. 5 .
  • DICT sr is generated by using renamed specific phonemes from PH sr to replace ordinate phonemes in the DICT s spellings and by renaming each of the text elements from DICT s with a unique, new name. For example, “ONE” is renamed “DDONE”, which new name does not occur elsewhere in DICT 1 or DICT sr .
  • the DICT sr entries for the current example are seen in FIG. 7 .
  • DICT 1r a reduced dictionary
  • DICT 1r is generated by removing the DICT s text elements and the corresponding spellings from DICT 1 .
  • a combined dictionary DICT com is generated.
  • the combined dictionary is generated from DICT 1r and DICT sr , and can also include DICT s , as follows:
  • DICT com DICT 1r +DICT sr +( ⁇ DICT S );
  • TRANS s is a subset of a general transcription set (TRANS 1 ), including only transcriptions having one or more text elements from TE s .
  • the transcriptions of TRANS 1 correspond to the recorded speech in a plurality of audio files in a general audio file list (AUD 1 ).
  • TRANS sr a renamed specific transcription set
  • TRANS sr is generated by renaming the text elements from TE s in TRANS s with the corresponding new, unique names from DICT sr .
  • TRANS 1r a reduced transcription set
  • TRANS 1r is generated by removing the TRANS s transcriptions from TRANS 1r .
  • TRANS com a combined transcription set
  • TRANS com is generated by combining TRANS 1r and TRANS sr , and can also include TRANS s , as follows:
  • TRANS com ( ⁇ 1 ⁇ TRANS 1r )+( ⁇ 2 ⁇ TRANS sr )+( ⁇ 3 ⁇ TRANS s );
  • AUD 1 should be updated to reflect the transcriptions in TRANS com .
  • the training of the acoustic model (block 140 ) can then be performed using PH com , DICT com , TRANS com and the audio files in the updated list (AUD 2 ).
  • the method can further include generating a dictionary for the decoder (DICT dec ) of the speech recognition engine.
  • each of the resulting entries in DICT dec for the specific text elements includes the unrenamed specific text element, the spelling with only ordinate phonemes and the spelling that includes renamed specific phonemes.
  • the entry for “ONE” in DICT dec would include:
  • the method 100 ends at block 144 , although the method can be repeated as desired or required to further optimize the acoustic model.
  • the contents of TE s and PH ss , and the values of ⁇ , ⁇ 1 , ⁇ 2 and ⁇ 3 can be varied to optimize the accuracy of the acoustic model.
  • the ordinate phonemes may be more completely isolated from the renamed specific phonemes.
  • this can eliminate a large amount of the available audio data used to determine feature space for the remaining general text elements and ordinate phonemes during acoustic model training, which may reduce recognition accuracy for the remaining general text elements.
  • the present invention is not necessarily limited to a particular mechanism for identifying the most frequently occurring text elements, or to selecting only the most frequently occurring text elements.
  • the specific text element set could include text elements that occur less frequently but are critical to the success or safety of a particular application.
  • aspects of the present invention allow several improvements to acoustic model training, and correspondingly, to the accuracy of the speech recognition engines.
  • the present invention allows improved isolation of the feature space for specific text elements from the remaining text elements.
  • the present invention still allows acoustic data associated with tokens of the specific phonemes and specific text elements to contribute to the development of feature space for ordinate phonemes and the remaining text elements.
  • the present invention allows a reduction in the number of renamed specific phonemes, thus reducing the size of the combined phoneme set used to train the acoustic model, while still generally increasing both speed and accuracy.
  • the present invention provides an effective method for selecting the most frequently occurring specific text elements from a text data set, which can include use of a text distribution table.
  • a text distribution table By allowing text data from additional texts beyond those of the training data set to be considered when determining text element occurrence, the present invention can allow for more accurate identification of the text elements most frequently used in a given application.

Abstract

Feature space variation associated with specific text elements is reduced by training an acoustic model with a phoneme set, dictionary and transcription set configured to better distinguish the specific text elements and at least some specific phonemes associated therewith. The specific text elements can include the most frequently occurring text elements from a text data set, which can include text data beyond the transcriptions of a training data set. The specific text elements can be identified using a text element distribution table sorted by occurrence within the text data set. Specific phonemes can be limited to consonant phonemes to improve speed and accuracy.

Description

    FIELD OF THE INVENTION
  • The present invention relates to speech recognition systems and methods, and more particularly, to training acoustic models for speech recognition engines.
  • BACKGROUND OF THE INVENTION
  • Referring to FIG. 8, in a typical speech recognition engine 1000, a signal 1002 corresponding to speech 1004 is fed into a front end module 1006. The front end 1006 module extracts feature data 1008 from the signal 1002. The feature data 1008 is input to a decoder 1010, which the decoder 1010 outputs as recognized speech 1012. The recognized speech 1012 is then available as an input to an application 1014.
  • An acoustic model 1018 and a language model 1020 also supply inputs to the decoder 1010. Generally, the acoustic model 1018, also called a voice model, identifies to which phonemes the feature data 1008 most likely correlate. The language model 1020 consists of the certain context dependent text elements, such as words, phrases and sentences, based on assumptions about what a user is likely to say. The language model 1020 cooperates with the acoustic model 1018 to assist the decoder 1010 in further constraining the possible options for the recognized speech 1012.
  • Referring to FIG. 9, by methods known in the art, the acoustic model 1018 is trained by training system 1110. The training system 1110 includes a training module 1112 using a phoneme set 1114, a dictionary 1116 and a training data set. The dictionary includes a plurality of text elements, such as words and/or phrases, and their phonetic spellings using phonemes from the phoneme set. The training data set include an audio file set 1118 including a plurality of audio files, such as wave files of recorded speech, and a transcription set 1120 including a plurality of transcriptions corresponding to the recorded speech in the audio files. Typically, the transcriptions are grouped into a single transcription file including a plurality of lines of transcribed text, each line of transcribed text including the name of the corresponding audio file.
  • In practice, the textual content of the training data set, represented by the transcriptions, is generally selected to cover a wide range of related applications. The resultant acoustic model can then be “shared” by all the related applications. While this approach saves the expense of training a separate acoustic model for individual applications there is a corresponding loss in accuracy for the individual applications.
  • SUMMARY OF THE INVENTION
  • In view of the foregoing, it is an object of the present invention to provide an improved system and method for training an acoustic model. According to an embodiment of the present invention, an acoustic model training system includes a combined phoneme set with renamed specific phonemes, a combined dictionary with renamed specific text elements and corresponding renamed phonetic spellings, an audio file set, a combined transcription set corresponding to the audio file set and including transcriptions with renamed specific text elements, and a training module configured to train the acoustic model based on the audio file set, the combined transcription set, the combined phoneme set and the combined dictionary.
  • According to a method aspect of the present invention, a method of training an acoustic model includes generating a specific text element set as a subset of a general text element set corresponding to a text data set. A combined phoneme set and a combined dictionary are also generated. The combined phoneme set includes renamed specific phonemes corresponding to the specific text element set and the combined dictionary includes the specific text element set with phonetic spellings using the renamed specific phonemes. A combined transcription set is additionally generated, the combined transcription set includes transcriptions with specific text elements from the specific text element set. The acoustic model is trained using the specific phonetic element set, the specific dictionary, the specific transcription set and an audio file set.
  • According to another aspect of the present invention, the text data set is not limited to text data from a training data set for the acoustic model, and can include text data from additional and/or alternate texts. According to a further aspect of the present invention, the specific text element set includes the most frequently occurring text elements from the text data set. The most frequently occurring text elements can be identified using a text element distribution table sorted by occurrence within the text data set, or by other methods. According to another aspect of the present invention, specific phonemes can be limited to consonant phonemes.
  • Thus, the present invention reduces feature space variation associated with specific text elements by training an acoustic model with a phoneme set, dictionary and transcription set configured to better distinguish the specific text elements and at least some specific phonemes associated therewith. These and other objects, aspects and advantages of the present invention will be better appreciated in view of the following detailed description of a preferred embodiment and accompanying drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a plot of word distribution in a sample text data set;
  • FIG. 2 is a schematic overview of a system for acoustic model training, according to an embodiment of the present invention;
  • FIG. 3 is a flow diagram of a method of training an acoustic model, according to a method aspect of the present invention;
  • FIG. 4 is a portion of an exemplary word distribution table for use in connection with the method of FIG. 3;
  • FIG. 5 is a portion of an exemplary dictionary for use in connection with the method of FIG. 3;
  • FIG. 6 is a chart of phoneme sets for use in connection with the method of FIG. 3;
  • FIG. 7 is a portion of another exemplary dictionary for use in connection with the method of FIG. 3;
  • FIG. 8 is a schematic overview of a typical speech recognition engine; and
  • FIG. 9 is a schematic overview of a typical acoustic model training system.
  • DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT
  • Referring to FIG. 1, word distribution in a sample text data set is presented. The sample text data set includes 17,846 words with 1,056,134 tokens. A “token” is any occurrence of a text element (in this example, a word) within the text data set, regardless of whether the text element has occurred before. For instance, the word “the” might occur 83,000 times in the text data set. Thus, the word “the” contributes 83,000 tokens to the text data set.
  • In FIG. 1, the x-axis represents the words in the sample text data set sorted into a distribution table from most frequently occurring to least frequently occurring. Thus, the first word in the table is the most frequently occurring and the 17,846th word in the table is the least frequently occurring. The y-axis represents the cumulative occurrences of the words, as a percentage of the total number of tokens in the text data set. Thus, if the first word contributes 6% of the tokens and the second word contributes 4% of the tokens, the cumulative occurrence of the first word is 6%, the cumulative occurrence of the first two words is 10%, and so on. The cumulative occurrence of all 17,846 words is 100%.
  • It will be appreciated from FIG. 1 that a relatively small minority of words contributes a substantial majority of the tokens. The top 120 words, only 0.7% of the 17,846 total words, have a cumulative occurrence of approximately 50%. The top 1185 words, only 2.2% of the 17,846 total words, have a cumulative occurrence of approximately 80%. While these percentages will vary between text data sets, a text element distribution table for a given text data set will typically reveal that a small percentage of the total number of text elements contributes a large percentage of the total number of tokens.
  • Existing acoustic model training does not adequately differentiate between more and less frequently occurring text elements, such as words, in a data set. Because the same phoneme set is used for all text elements, the variation in feature space for infrequently occurring text elements can adversely affect the variation in feature space for the most frequently occurring text elements. By effectively differentiating specific text elements during acoustic model training, the recognition error rate for those specific text elements can be significantly reduced. Moreover, if the specific text elements are selected from among the most frequently occurring text elements in the text data set, the overall accuracy of the speech recognition engine can be increased significantly.
  • Referring to FIG. 2, according to another embodiment of the present invention, an acoustic model training system 10 includes a training module 12 that receives inputs from a combined phoneme set 14, a combined dictionary 16, a combined transcription set 18 and an audio file set 20. The system 10 further includes a feature space reduction module 30 that generates the combined phoneme set 14, combined dictionary 16, and combined transcription set 18 based on inputs from an ordinate phoneme set 32, a general dictionary 34 and a general transcription set 36. The feature space reduction module 30 can also accept inputs of additional text data 38 and weighting coefficients 40.
  • It will be appreciated that speech recognition engines are inherently machine processor-based. Accordingly, the systems and methods herein are realized by at least one processor executing machine-readable code and that inputs to and outputs of the system or method are stored, at least temporarily, in some form of machine-readable memory. However, the present invention is not necessarily limited to particular processor types, numbers or designs, to particular code formats or languages, or to particular hardware or software memory media.
  • The combined phoneme set 14 includes ordinate phonemes and renamed specific phonemes. The renamed specific phonemes advantageously correspond to specific text elements selected from frequently used text elements in a text data set. The text data set can include, for example, the text data from the transcription set and/or the additional text data. Advantageously, the renamed specific phonemes correspond to only a portion of the phonemes required to phonetically spell the specific text elements. For example, the renamed specific phonemes can include only phonemes corresponding to consonants.
  • The combined dictionary 16 includes all the text elements in the training data set and renamed specific text elements corresponding to the specific text elements. The combined dictionary 16 further includes phonetic spellings for all the text elements therein. The spellings for the renamed specific text elements include the renamed specific phonemes. It will be appreciated that the spellings for the specific text elements can also include some ordinate phonemes. For example, if the renamed specific phonemes include only phonemes corresponding to consonants, the vowel phonemes in the spellings for the renamed specific text elements will be ordinate phonemes.
  • The combined transcription set 18 includes transcriptions of the recorded speech in the audio file set, including transcriptions having renamed specific text elements therein, as well as transcriptions without any renamed specific text elements. The audio files 20 are listed in an audio file list and the transcriptions are included in a transcription file.
  • Based on the combined phoneme set 14, combined dictionary 16, combined transcription set 18, and audio file set 20, the training module 12 trains the acoustic model according to various methods known in the art. In connection with acoustic model training, the training module 12 can be configured to identify context-dependent phonemes and sub-phonemes, as well as developing other acoustic model attributes.
  • According to a method aspect of the present invention, referring to FIG. 3, a method 100 for training an acoustic model for a speech recognition engine begins at block 102. At block 104, a text element distribution table is generated from a text data set. The text data set can include, but is not limited to, text data from the transcriptions in a training data set for the acoustic model. The text data set can also include text data from alternate or additional texts.
  • The distribution table is generated by determining the occurrences of each text element within the training data set and sorting the text elements by the occurrence. Advantageously, the cumulative occurrence of the text elements is also determined, beginning from the most frequently occurring text element in distribution the table.
  • Referring to FIG. 4, a portion of an exemplary distribution table is presented, in which the top 22 words are shown from a text data set with 1056 words and 132,008 tokens. In this example, the top 22 words, 2.1% of the total number of words, have a cumulative occurrence of 50.11%. The ten words for single digit numbers (e.g., zero, one, two, etc.—shaded in the FIG. 4) have a cumulative occurrence of 32.91%.
  • Referring again to FIG. 3, at block 106, a specific text element set (TEs) is generated. TEs is a subset of a general text element set (TE1), such as all the words in the text data set. For example, TEs includes the ten words for single digit numbers from TE1, where TE1 includes all 1056 words. Advantageously, TEs is selected to include a low number of the most frequently occurring text elements whose spellings require only a small number of phonemes.
  • At block 108, an intermediate specific phoneme (PHs) set is generated. PHs is a subset of an ordinate phoneme set (PH1) and includes only those phonemes needed for phonetic spellings of the text elements in TEs. The phonetic spellings of the text elements in TEs are determined from a general dictionary (DICT1), including TE1 and the corresponding phonetic spellings. An example of DICT1 entries for the single digit numbers is seen in FIG. 5. Referring to FIG. 6, in the current example, PH1 includes 41 ordinate phonemes, of which only 20 are required for PHs.
  • The present applicant has determined that consonant phonemes are particularly significant to speech recognition accuracy. Advantageously, the size of PHs is further reduced by selecting only phonemes from PHs for consonants to generate a further intermediate specific phoneme set (PHss). From FIG. 6, it will be appreciated that PHss used only 10 phonemes from PHs.
  • Referring again to FIG. 3, at block 110, a renamed specific phoneme set PHsr is generated. PHsr is generated by assigning a unique, new name to each phoneme in PHss. For example, the “B” phoneme is renamed “DB”, which new name does not occur elsewhere in PH1 or PHsr. At block 112, a combined phoneme set (PHcom) is generated by combining PH1 and PHsr. Referring again to FIG. 6, PHcom includes 51 phonemes; the 41 ordinate phonemes from PH1 and 10 renamed specific phonemes from PHsr.
  • Referring again to FIG. 3, at block 120, an intermediate specific dictionary (DICTs) is generated. DICTs is a subset of DICT1 including only the text elements from TEs and the corresponding phonetic spellings. Thus, in the current example, DICTs includes only those entries of DICT1 seen in FIG. 5. Referring again to FIG. 3, at block 122, a renamed specific dictionary (DICTsr) is generated. DICTsr is generated by using renamed specific phonemes from PHsr to replace ordinate phonemes in the DICTs spellings and by renaming each of the text elements from DICTs with a unique, new name. For example, “ONE” is renamed “DDONE”, which new name does not occur elsewhere in DICT1 or DICTsr. The DICTsr entries for the current example are seen in FIG. 7.
  • Referring again to FIG. 3, at block 124, a reduced dictionary (DICT1r) is generated. DICT1r is generated by removing the DICTs text elements and the corresponding spellings from DICT1. At block 126, a combined dictionary DICTcom is generated. The combined dictionary is generated from DICT1r and DICTsr, and can also include DICTs, as follows:

  • DICTcom=DICT1r+DICTsr+(δ×DICTS);
  • where δ=1 or 0.
  • If δ=0, the TEs text elements will only be represented in DICTcom by the DICTsr entries during subsequent acoustic model training, but if δ=1 the TEs text elements will be represented by both the DICTsr and DICTs entries.
  • At block 130, an intermediate specific transcription set (TRANSs) is generated. TRANSs is a subset of a general transcription set (TRANS1), including only transcriptions having one or more text elements from TEs. The transcriptions of TRANS1 correspond to the recorded speech in a plurality of audio files in a general audio file list (AUD1).
  • At block 132, a renamed specific transcription set (TRANSsr) is generated. TRANSsr is generated by renaming the text elements from TEs in TRANSs with the corresponding new, unique names from DICTsr. At block 134, a reduced transcription set (TRANS1r) is generated. TRANS1r is generated by removing the TRANSs transcriptions from TRANS1r. At block 136, a combined transcription set (TRANScom) is generated. TRANScom is generated by combining TRANS1r and TRANSsr, and can also include TRANSs, as follows:

  • TRANScom=(λ1×TRANS1r)+(λ2×TRANSsr)+(λ3×TRANSs);
  • where λ1, λ2 and λ3 are weighting coefficients, and λ1 and λ2 can be 1, 2, . . . , N and λ3 can be 0, 1, . . . , M. However, λ3 should be 0 only if δ=0.
  • Once TRANScom is generated, AUD1 should be updated to reflect the transcriptions in TRANScom. The training of the acoustic model (block 140) can then be performed using PHcom, DICTcom, TRANScom and the audio files in the updated list (AUD2).
  • At block 142, the method can further include generating a dictionary for the decoder (DICTdec) of the speech recognition engine. Generating DICTdec includes removing the renamed text elements from DICTcom. If DICTcom does not include the unrenamed specific text elements (e.g., if δ=0), then the unrenamed specific text elements are substituted for the renamed specific text elements in the DICTdec entries. In addition to the spelling with renamed specific phonemes, the entry for each specific text element in DICTdec is supplemented with the spelling using only ordinate phonemes. If DICTcom includes the specific text elements with their original names and spellings (e.g., if δ=1), then the renamed spellings from DICTcom are associated with the entries of specific text elements as an alternate spelling for the DICTdec entries.
  • In either case (e.g., δ=0 or δ=1), each of the resulting entries in DICTdec for the specific text elements includes the unrenamed specific text element, the spelling with only ordinate phonemes and the spelling that includes renamed specific phonemes. For example, using the spellings for “ONE” and “DDONE” of the current example (see FIGS. 5 and 7), the entry for “ONE” in DICTdec would include:
  • ONE (1) W AH N
    (2) DW AH DN
  • The method 100 ends at block 144, although the method can be repeated as desired or required to further optimize the acoustic model. In particular, the contents of TEs and PHss, and the values of δ, λ1, λ2 and λ3, can be varied to optimize the accuracy of the acoustic model.
  • For instance, by setting δ=0, the ordinate phonemes may be more completely isolated from the renamed specific phonemes. However, this can eliminate a large amount of the available audio data used to determine feature space for the remaining general text elements and ordinate phonemes during acoustic model training, which may reduce recognition accuracy for the remaining general text elements. Depending on the size of the training data set, it may be advantageous to increase the audio data available for ordinate phonemes by setting δ=1 and further adjusting λ1, λ2 and λ3. In this way, the audio data associated with the specific text elements and specific phonemes can still be utilized in determining the feature space for the remaining general text elements and their ordinate phonemes.
  • Additionally, it will be appreciated that all the method steps enumerated above are not necessary for every execution of the method 100 for training an acoustic model. Also, the steps are not necessarily limited to the sequence described, and many steps can be performed in other orders, in parallel, or iteratively. Moreover, the present invention can be employed for the initial training of an acoustic model, as well as for re-training a previously trained acoustic model.
  • Also, while it is advantageous to form the specific text element set from the most frequently occurring text elements in the data set based on a text distribution table, the present invention is not necessarily limited to a particular mechanism for identifying the most frequently occurring text elements, or to selecting only the most frequently occurring text elements. For example, the specific text element set could include text elements that occur less frequently but are critical to the success or safety of a particular application.
  • From the foregoing, it will be appreciated that aspects of the present invention allow several improvements to acoustic model training, and correspondingly, to the accuracy of the speech recognition engines. For instance, the present invention allows improved isolation of the feature space for specific text elements from the remaining text elements. Additionally, while permitting this improved isolation, the present invention still allows acoustic data associated with tokens of the specific phonemes and specific text elements to contribute to the development of feature space for ordinate phonemes and the remaining text elements. Moreover, the present invention allows a reduction in the number of renamed specific phonemes, thus reducing the size of the combined phoneme set used to train the acoustic model, while still generally increasing both speed and accuracy.
  • Furthermore, the present invention provides an effective method for selecting the most frequently occurring specific text elements from a text data set, which can include use of a text distribution table. By allowing text data from additional texts beyond those of the training data set to be considered when determining text element occurrence, the present invention can allow for more accurate identification of the text elements most frequently used in a given application.
  • In general, the foregoing description is provided for exemplary and illustrative purposes; the present invention is not necessarily limited thereto. Rather, those skilled in the art will appreciate that additional modifications, as well as adaptations for particular circumstances, will fall within the scope of the invention as herein shown and described and the claims appended hereto.

Claims (30)

1. A method of training an acoustic model, the method comprising:
generating a specific text element set as a subset of a general text element set;
generating a combined phoneme set, the combined phoneme set including renamed specific phonemes corresponding to the specific text element set;
generating a combined dictionary, the combined dictionary including renamed specific text elements from the specific text element set with phonetic spellings including the renamed specific phonemes;
generating a combined transcription set, the combined transcription set including transcriptions with the renamed specific text elements; and
training the acoustic model using the combined phoneme set, the combined dictionary, the combined transcription set and an audio file set.
2. The method of claim 1, wherein the specific text element set is at least one of a specific word set or a specific phrase set.
3. The method of claim 1, wherein the renamed specific phonemes include only phonemes corresponding to consonants.
4. The method of claim 1, further comprising generating a text element distribution table, the distribution table including the general text element set sorted by occurrence within a text data set.
5. The method of claim 4, wherein generating the specific text element set includes selecting specific text elements from the distribution table based on frequency of occurrence within the data set.
6. The method of claim 4, wherein selecting specific text elements from the distribution table based on frequency of occurrence includes selecting a minimum number of specific text elements sufficient to contribute a predetermined cumulative occurrence within the data set.
7. The method of claim 4, wherein the text data set includes text data from texts different than the transcriptions.
8. The method of claim 1, wherein generating the combined phoneme set includes generating an intermediate specific phoneme set, the intermediate specific phoneme set including specific phonemes from an ordinate phoneme set, the specific phonemes including only those ordinate phonemes needed for phonetic spellings of the specific text element set.
9. The method of claim 8, wherein only the specific phonemes corresponding to consonants are retained in the intermediate specific phoneme set.
10. The method of claim 8, wherein generating the combined phoneme set further includes generating a renamed specific phoneme set by renaming the specific phonemes with unique, new names.
11. The method of claim 10, wherein generating the combined phoneme set further includes combining the renamed specific phoneme set with the ordinate phoneme set.
12. The method of claim 1, wherein generating the combined dictionary includes generating an intermediate specific dictionary by selecting, from a general dictionary, only specific text elements from the specific text element set and corresponding phonetic spellings.
13. The method of claim 12, wherein generating the combined dictionary further includes generating a renamed specific dictionary by renaming the specific text elements with unique new, names and respelling the renamed specific text elements with the renamed specific phonemes.
14. The method of claim 12, wherein generating the combined dictionary further includes generating a reduced dictionary by removing the specific text elements and the corresponding phonetic spellings from the general dictionary, and combining the reduced dictionary and the renamed specific dictionary.
15. The method of claim 14, wherein generating the combined dictionary further includes selectively combining the reduced dictionary and the renamed specific dictionary with the intermediate specific dictionary.
16. The method of claim 1, wherein generating the combined transcription set includes generating an intermediate specific transcription set including only those transcriptions from a general transcription set that include at least one specific text element from the specific text element set.
17. The method of claim 16, wherein generating the combined transcription set further includes generating a renamed specific transcription set by replacing the specific text elements with the renamed specific text elements.
18. The method of claim 17, wherein generating the combined transcription set further includes generating a reduced transcription set by removing the transcriptions of the intermediate specific transcription set from the general transcription set, and combining at least the reduced transcription set and the renamed specific transcription set.
19. The method of claim 18, wherein generating the combined transcription set further includes selectively combining the reduced transcription set and the renamed specific transcription set with the intermediate specific transcription set.
20. The method of claim 19, wherein selectively combining the reduced transcription set and the renamed specific transcription set with the intermediate specific transcription set further includes applying weighting coefficients to each transcription set.
21. An acoustic model training system comprising:
a combined phoneme set including renamed specific phonemes;
a combined dictionary including renamed specific text elements with corresponding phonetic spellings using the renamed specific phonemes;
an audio file set;
a combined transcription set corresponding to the audio file set and including transcriptions with the renamed specific text elements; and
a training module configured to train the acoustic model based on the audio file set, the combined transcription set, the combined phoneme set and the combined dictionary.
22. The system of claim 21, wherein the combined phoneme set further includes ordinate phonemes.
23. The system of claim 21, wherein the combined dictionary further includes general text elements with corresponding ordinate phonetic spellings.
24. The system of claim 23, wherein the general text elements include unrenamed specific text elements.
25. The system of claim 21, wherein the combined transcription set further includes transcriptions without any renamed specific text elements.
26. A method of training an acoustic model, the method comprising:
generating a specific word set based on frequency of occurrence within a text data set;
generating a phoneme set including renamed specific phonemes used in phonetic spellings of the specific word set;
generating a dictionary including renamed specific words of the specific word set with phonetic spellings including the renamed specific phonemes;
generating a transcription set including transcriptions having the renamed specific words therein; and
training the acoustic model based on the phoneme set, the dictionary, the transcription set and an audio file set.
27. The method of claim 26, wherein the text data set differs from a training data set for the acoustic model.
28. The method of claim 26, wherein the renamed specific phonemes include only phonemes corresponding to consonants.
29. The method of claim 26, wherein generating the dictionary and the transcription set include selectively including unrenamed specific words with phonetic spellings without renamed specific phonemes.
30. The method of claim 29, wherein selectively including unrenamed specific words with phonetic spellings without renamed specific phonemes includes setting weighting coefficients to determine a relative contribution of the unrenamed specific words.
US12/413,896 2009-03-30 2009-03-30 System and method for training an acoustic model with reduced feature space variation Active 2031-08-31 US8301446B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/413,896 US8301446B2 (en) 2009-03-30 2009-03-30 System and method for training an acoustic model with reduced feature space variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/413,896 US8301446B2 (en) 2009-03-30 2009-03-30 System and method for training an acoustic model with reduced feature space variation

Publications (2)

Publication Number Publication Date
US20100250240A1 true US20100250240A1 (en) 2010-09-30
US8301446B2 US8301446B2 (en) 2012-10-30

Family

ID=42785340

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/413,896 Active 2031-08-31 US8301446B2 (en) 2009-03-30 2009-03-30 System and method for training an acoustic model with reduced feature space variation

Country Status (1)

Country Link
US (1) US8301446B2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110022386A1 (en) * 2009-07-22 2011-01-27 Cisco Technology, Inc. Speech recognition tuning tool
US20110307252A1 (en) * 2010-06-15 2011-12-15 Microsoft Corporation Using Utterance Classification in Telephony and Speech Recognition Applications
WO2012145519A1 (en) * 2011-04-20 2012-10-26 Robert Bosch Gmbh Speech recognition using multiple language models
US20150066503A1 (en) * 2013-08-28 2015-03-05 Verint Systems Ltd. System and Method of Automated Language Model Adaptation
US20150066502A1 (en) * 2013-08-28 2015-03-05 Verint Systems Ltd. System and Method of Automated Model Adaptation
US20160189103A1 (en) * 2014-12-30 2016-06-30 Hon Hai Precision Industry Co., Ltd. Apparatus and method for automatically creating and recording minutes of meeting

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9135912B1 (en) * 2012-08-15 2015-09-15 Google Inc. Updating phonetic dictionaries
US9443433B1 (en) 2015-04-23 2016-09-13 The Boeing Company Method and system to monitor for conformance to a traffic control instruction
TWI610294B (en) 2016-12-13 2018-01-01 財團法人工業技術研究院 Speech recognition system and method thereof, vocabulary establishing method and computer program product

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5865626A (en) * 1996-08-30 1999-02-02 Gte Internetworking Incorporated Multi-dialect speech recognition method and apparatus
US6085160A (en) * 1998-07-10 2000-07-04 Lernout & Hauspie Speech Products N.V. Language independent speech recognition
US6236965B1 (en) * 1998-11-11 2001-05-22 Electronic Telecommunications Research Institute Method for automatically generating pronunciation dictionary in speech recognition system
US6389394B1 (en) * 2000-02-09 2002-05-14 Speechworks International, Inc. Method and apparatus for improved speech recognition by modifying a pronunciation dictionary based on pattern definitions of alternate word pronunciations
US6434521B1 (en) * 1999-06-24 2002-08-13 Speechworks International, Inc. Automatically determining words for updating in a pronunciation dictionary in a speech recognition system
US20020111805A1 (en) * 2001-02-14 2002-08-15 Silke Goronzy Methods for generating pronounciation variants and for recognizing speech
US20020156627A1 (en) * 2001-02-20 2002-10-24 International Business Machines Corporation Speech recognition apparatus and computer system therefor, speech recognition method and program and recording medium therefor
US20020173966A1 (en) * 2000-12-23 2002-11-21 Henton Caroline G. Automated transformation from American English to British English
US20020173958A1 (en) * 2000-02-28 2002-11-21 Yasuharu Asano Speech recognition device and speech recognition method and recording medium
US20030110035A1 (en) * 2001-12-12 2003-06-12 Compaq Information Technologies Group, L.P. Systems and methods for combining subword detection and word detection for processing a spoken input
US20040172247A1 (en) * 2003-02-24 2004-09-02 Samsung Electronics Co., Ltd. Continuous speech recognition method and system using inter-word phonetic information
US20040176946A1 (en) * 2002-10-17 2004-09-09 Jayadev Billa Pronunciation symbols based on the orthographic lexicon of a language
US20040210438A1 (en) * 2002-11-15 2004-10-21 Gillick Laurence S Multilingual speech recognition
US20050197835A1 (en) * 2004-03-04 2005-09-08 Klaus Reinhard Method and apparatus for generating acoustic models for speaker independent speech recognition of foreign words uttered by non-native speakers
US20050256715A1 (en) * 2002-10-08 2005-11-17 Yoshiyuki Okimoto Language model generation and accumulation device, speech recognition device, language model creation method, and speech recognition method
US7113908B2 (en) * 2001-03-07 2006-09-26 Sony Deutschland Gmbh Method for recognizing speech using eigenpronunciations
US20060224384A1 (en) * 2005-03-31 2006-10-05 International Business Machines Corporation System and method for automatic speech recognition
US7139708B1 (en) * 1999-03-24 2006-11-21 Sony Corporation System and method for speech recognition using an enhanced phone set
US20070294082A1 (en) * 2004-07-22 2007-12-20 France Telecom Voice Recognition Method and System Adapted to the Characteristics of Non-Native Speakers
US7467087B1 (en) * 2002-10-10 2008-12-16 Gillick Laurence S Training and using pronunciation guessers in speech recognition
US20090024392A1 (en) * 2006-02-23 2009-01-22 Nec Corporation Speech recognition dictionary compilation assisting system, speech recognition dictionary compilation assisting method and speech recognition dictionary compilation assisting program
US20100145707A1 (en) * 2008-12-04 2010-06-10 At&T Intellectual Property I, L.P. System and method for pronunciation modeling

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5865626A (en) * 1996-08-30 1999-02-02 Gte Internetworking Incorporated Multi-dialect speech recognition method and apparatus
US6085160A (en) * 1998-07-10 2000-07-04 Lernout & Hauspie Speech Products N.V. Language independent speech recognition
US6236965B1 (en) * 1998-11-11 2001-05-22 Electronic Telecommunications Research Institute Method for automatically generating pronunciation dictionary in speech recognition system
US7139708B1 (en) * 1999-03-24 2006-11-21 Sony Corporation System and method for speech recognition using an enhanced phone set
US6434521B1 (en) * 1999-06-24 2002-08-13 Speechworks International, Inc. Automatically determining words for updating in a pronunciation dictionary in a speech recognition system
US6389394B1 (en) * 2000-02-09 2002-05-14 Speechworks International, Inc. Method and apparatus for improved speech recognition by modifying a pronunciation dictionary based on pattern definitions of alternate word pronunciations
US20020173958A1 (en) * 2000-02-28 2002-11-21 Yasuharu Asano Speech recognition device and speech recognition method and recording medium
US20020173966A1 (en) * 2000-12-23 2002-11-21 Henton Caroline G. Automated transformation from American English to British English
US20020111805A1 (en) * 2001-02-14 2002-08-15 Silke Goronzy Methods for generating pronounciation variants and for recognizing speech
US20020156627A1 (en) * 2001-02-20 2002-10-24 International Business Machines Corporation Speech recognition apparatus and computer system therefor, speech recognition method and program and recording medium therefor
US7113908B2 (en) * 2001-03-07 2006-09-26 Sony Deutschland Gmbh Method for recognizing speech using eigenpronunciations
US20030110035A1 (en) * 2001-12-12 2003-06-12 Compaq Information Technologies Group, L.P. Systems and methods for combining subword detection and word detection for processing a spoken input
US20050256715A1 (en) * 2002-10-08 2005-11-17 Yoshiyuki Okimoto Language model generation and accumulation device, speech recognition device, language model creation method, and speech recognition method
US7467087B1 (en) * 2002-10-10 2008-12-16 Gillick Laurence S Training and using pronunciation guessers in speech recognition
US20040176946A1 (en) * 2002-10-17 2004-09-09 Jayadev Billa Pronunciation symbols based on the orthographic lexicon of a language
US20040210438A1 (en) * 2002-11-15 2004-10-21 Gillick Laurence S Multilingual speech recognition
US20040172247A1 (en) * 2003-02-24 2004-09-02 Samsung Electronics Co., Ltd. Continuous speech recognition method and system using inter-word phonetic information
US20050197835A1 (en) * 2004-03-04 2005-09-08 Klaus Reinhard Method and apparatus for generating acoustic models for speaker independent speech recognition of foreign words uttered by non-native speakers
US20070294082A1 (en) * 2004-07-22 2007-12-20 France Telecom Voice Recognition Method and System Adapted to the Characteristics of Non-Native Speakers
US20060224384A1 (en) * 2005-03-31 2006-10-05 International Business Machines Corporation System and method for automatic speech recognition
US20090024392A1 (en) * 2006-02-23 2009-01-22 Nec Corporation Speech recognition dictionary compilation assisting system, speech recognition dictionary compilation assisting method and speech recognition dictionary compilation assisting program
US20100145707A1 (en) * 2008-12-04 2010-06-10 At&T Intellectual Property I, L.P. System and method for pronunciation modeling

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110022386A1 (en) * 2009-07-22 2011-01-27 Cisco Technology, Inc. Speech recognition tuning tool
US9183834B2 (en) * 2009-07-22 2015-11-10 Cisco Technology, Inc. Speech recognition tuning tool
US20110307252A1 (en) * 2010-06-15 2011-12-15 Microsoft Corporation Using Utterance Classification in Telephony and Speech Recognition Applications
US8972260B2 (en) 2011-04-20 2015-03-03 Robert Bosch Gmbh Speech recognition using multiple language models
WO2012145519A1 (en) * 2011-04-20 2012-10-26 Robert Bosch Gmbh Speech recognition using multiple language models
US20150066503A1 (en) * 2013-08-28 2015-03-05 Verint Systems Ltd. System and Method of Automated Language Model Adaptation
US20150066502A1 (en) * 2013-08-28 2015-03-05 Verint Systems Ltd. System and Method of Automated Model Adaptation
US9508346B2 (en) * 2013-08-28 2016-11-29 Verint Systems Ltd. System and method of automated language model adaptation
US9633650B2 (en) * 2013-08-28 2017-04-25 Verint Systems Ltd. System and method of automated model adaptation
US9990920B2 (en) 2013-08-28 2018-06-05 Verint Systems Ltd. System and method of automated language model adaptation
US10733977B2 (en) 2013-08-28 2020-08-04 Verint Systems Ltd. System and method of automated model adaptation
US11545137B2 (en) * 2013-08-28 2023-01-03 Verint Systems Inc. System and method of automated model adaptation
US20160189103A1 (en) * 2014-12-30 2016-06-30 Hon Hai Precision Industry Co., Ltd. Apparatus and method for automatically creating and recording minutes of meeting

Also Published As

Publication number Publication date
US8301446B2 (en) 2012-10-30

Similar Documents

Publication Publication Date Title
US8301446B2 (en) System and method for training an acoustic model with reduced feature space variation
US20070255567A1 (en) System and method for generating a pronunciation dictionary
Schuster et al. Japanese and korean voice search
US8126714B2 (en) Voice search device
Kwon et al. Korean large vocabulary continuous speech recognition with morpheme-based recognition units
US6738741B2 (en) Segmentation technique increasing the active vocabulary of speech recognizers
Masmoudi et al. A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition.
US20080027725A1 (en) Automatic Accent Detection With Limited Manually Labeled Data
Alghamdi et al. Automatic restoration of arabic diacritics: a simple, purely statistical approach
Kirchhoff et al. Novel speech recognition models for Arabic
US20060241936A1 (en) Pronunciation specifying apparatus, pronunciation specifying method and recording medium
Diehl et al. Morphological decomposition in Arabic ASR systems
Lileikytė et al. Conversational telephone speech recognition for Lithuanian
Niesler et al. Phonetic analysis of afrikaans, english, xhosa and zulu using South African speech databases
Zhao et al. Automatic interlinear glossing for under-resourced languages leveraging translations
Alsharhan et al. Evaluating the effect of using different transcription schemes in building a speech recognition system for Arabic
Masmoudi et al. Phonetic tool for the Tunisian Arabic
Ananthakrishnan et al. Automatic diacritization of Arabic transcripts for automatic speech recognition
Kłosowski Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling
Nikulasdóttir et al. Open ASR for Icelandic: Resources and a baseline system
Manghat et al. Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages
Pellegrini et al. Automatic word decompounding for asr in a morphologically rich language: Application to amharic
Schubotz et al. Y’know vs. you know: What phonetic reduction can tell us about pragmatic function
Shan et al. Search by voice in mandarin chinese
Pellegrini et al. Investigating automatic decomposition for ASR in less represented languages

Legal Events

Date Code Title Description
AS Assignment

Owner name: ADACEL SYSTEMS, INC., FLORIDA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SHU, CHANG-QING;REEL/FRAME:022988/0783

Effective date: 20090720

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2552); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

Year of fee payment: 8

AS Assignment

Owner name: BANK OF MONTREAL, CANADA

Free format text: SECURITY INTEREST;ASSIGNOR:ADACEL SYSTEMS, INC.;REEL/FRAME:061367/0019

Effective date: 20220930