WO2000055842A2 - Speech synthesis - Google Patents
Speech synthesis Download PDFInfo
- Publication number
- WO2000055842A2 WO2000055842A2 PCT/GB2000/000854 GB0000854W WO0055842A2 WO 2000055842 A2 WO2000055842 A2 WO 2000055842A2 GB 0000854 W GB0000854 W GB 0000854W WO 0055842 A2 WO0055842 A2 WO 0055842A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- input
- prosodic
- words
- boundaries
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
Definitions
- the present invention relates to a method and apparatus for converting text to speech.
- text-to-speech conversion apparatus has improved markedly over recent years, the sound of such apparatus reading a piece of text is still distinguishable from the sound of a human reading the same text.
- text-to- speech converters occasionally apply phrasing that differs from that which would be applied by a human reader. This makes speech synthesised from text more onerous to listen to than speech read by a human.
- a method of converting text to speech comprising the steps of: receiving an input word sequence in the form of text; comparing said input word sequence with each one of a plurality of reference word sequences provided with phrasing information; identifying one or more reference word sequences which most closely match said input word sequence; and predicting phrasing for a synthesised spoken version of the input text on the basis of the phrasing information included with said one or more most closely matching reference word sequences.
- the method involves the matching of syntactic characteristics of words or groups of words. It could instead involve the matching of the words themselves, but that would require a large amount of storage and processing power.
- the method could compare the role of the words in the sentence - i.e. it could identify words or groups of words as the subject, verb or object of a sentence etc. and then look for one or more reference sentences with a similar pattern of subject, verb, object etc.
- the method further comprises the step of identifying clusters of words in the input text which are unlikely to include prosodic phrase boundaries.
- the reference sentences are further provided with information identifying such clusters of words within them.
- the comparison step then comprises a plurality of per-cluster comparisons.
- Measures of similarity between the input clusters and reference clusters which might be used include:
- c) measures of similarity in the number of words or syllables in the input cluster and the reference cluster.
- d measures of similarity in the role (e.g. subject, verb, object) of the input cluster and the reference cluster;
- f measures of similarity in word grouping information, such as the start and end of sentences and paragraphs.
- g measures of similarity in whether new or previously information is being presented in the cluster.
- the comparison comprises measuring the similarity in the positions of prosodic boundaries previously predicted for the input sentence and the positions of the prosodic boundaries in the reference sequences. In a preferred embodiment a weighted combination of all the above measures is used.
- a text to speech conversion apparatus comprising: a word sequence store storing a plurality of reference word sequences which are provided with prosodic boundary information; a program store storing a program; a processor in communication with said program store and the word sequence store; means for receiving an input word sequence in the form of text; wherein said program controls said processor to: compare said input word sequence with each one of a plurality of said reference word sequences; identify one or more reference word sequences which most closely match said input word sequence; and derive prosodic boundary information for the input text on the basis of the prosodic boundary information included with said one or more most closely matching reference word sequences.
- a program storage device readable by a computer, said device embodying computer readable code executable by the computer to perform a method according to the first aspect of the present invention.
- Figure 1 shows the hardware used in providing a first embodiment of the present invention
- Figures 2A and 2B show the top-level design of a text-to-speech conversion program which controls the operation of the hardware shown in Figure 1 ;
- Figures 3A & 3B show the text analysis process of Figure 2A in more detail
- Figure 4 is a diagram showing part of a syntactic classification of words.
- Figure 5 is a flow chart illustrating the prosodic structure assignment process of Figure 2B.
- Figure 1 shows a hardware configuration of a personal computer operable to provide a first embodiment of the present invention.
- the computer has a central processing unit 10 which is connected by data lines to a Random Access Memory (RAM) 1 2, a hard disc 14, a CD-ROM drive 1 6, input/output peripherals 1 8,20,22 and two interface cards 24,28.
- the input/output peripherals include a visual display unit 1 8, a keyboard 20 and a mouse 22.
- the interface cards comprise a sound card 24 which connects the computer to a loudspeaker 26 and a network card 28 which connects the computer to the Internet 30.
- a CD-ROM 32 carries: a) software which the computer can execute to provide the user with a text-to- speech facility; and b) five databases used in the text-to-speech conversion process.
- the user loads the CD-ROM 32 into the CD-ROM drive 1 6 and then, using the keyboard 20 and the mouse 22, causes the computer to copy the software and databases from the CD-ROM 32 to the hard disc 14.
- the user can then select a text-representing file (such as an e-mail loaded into the computer from the Internet 30) and run the text-to-speech program to cause the computer to produce a spoken version of the e-mail via the loudspeaker 26.
- a text-representing file such as an e-mail loaded into the computer from the Internet 30
- run the text-to-speech program to cause the computer to produce a spoken version of the e-mail via the loudspeaker 26.
- On running the text-to-speech program both the program itself and the databases are loaded into the RAM 12.
- the text-to-speech program then controls the computer to carry out the functions illustrated in Figures 2A and 2B.
- the computer first carries out text analysis process 42 on the e-mail (shown as text 40) which the user has indicated he wishes to be converted to speech.
- the text analysis process 42 uses a lexicon 44 (the first of the five databases stored on the CD-ROM 32) to generate word grouping data 46, syntactic information 48 and phonetic transcription data 49 concerning the text-file 40.
- the output data 46,48,49 is stored in the RAM 1 2.
- the program controls the computer to carry out the prosodic structure prediction process 50.
- the process 50 operates on the syntactic data 48 and word grouping data 46 stored in RAM 1 2 to produce phrase boundary data 54.
- the phrase boundary data 54 is also stored in RAM 1 2.
- the prosodic structure prediction process 50 uses the prosodic structure corpus 52 (which is the second of the five databases stored on the CD-ROM 32). The process will be described in more detail (with reference to Figures 4 and 5) below.
- the program controls the computer to carry out prosody prediction process (Figure 2B, 56) to generate performance data 58 which includes data on the pitch, amplitude and duration of phonemes to be used in generating the output speech 72.
- prosody prediction process 56 is given in Edgington M et al: Overview of current text-to-speech techniques part 2 - prosody and speech synthesis', BT Technology Journal, Volume 14, No. 1 , pp 84-99 (January 1 996). The disclosure of that paper (hereinafter referred to as part 2 of the BTTJ article) is hereby incorporated herein by reference.
- the computer performs a speech sound generation process 62 to convert the phonetic transcription data 49 to a raw speech waveform 66.
- the process 62 involves the concatenation of segments of speech waveforms stored in a speech waveform database 64 (the speech waveform database is the third of the five databases stored on the CD-ROM 32).
- Suitable methods for carrying out the speech sound generation process 62 are disclosed in the applicant's European patent no. 0 71 2 529 and European patent application no. 95302474.9. Further details of such methods can be found in part 2 of the BTTJ article.
- the computer carries out a prosody and speech combination process 70 to manipulate the raw speech waveform data 66 in accordance with the performance data 58 to produce speech data 72.
- a prosody and speech combination process 70 to manipulate the raw speech waveform data 66 in accordance with the performance data 58 to produce speech data 72.
- suitable software to carry out combination process 70.
- Part 2 of the BTTJ article describes the process 70 in more detail.
- the program then controls the computer to forward the speech data 72 to the sound card 24 where it is converted to an analogue electrical signal which is used to drive loudspeaker 26 to produce a spoken version of the text file 40.
- the text analysis process 42 is illustrated in more detail in Figures 3A and 3B.
- the program first controls the computer to execute a segmentation and normalisation process (Figure 3A, 80).
- the normalisation aspect of the process 80 involves the expansion of numerals, abbreviations, and amounts of money into the form of words, thereby generating an expanded text file 88.
- '£100' in the text file 40 is expanded to 'one hundred pounds' in the expanded text file 88.
- abbreviations database 82 which is the fourth of the five databases stored on the CD-ROM 32.
- the segmentation aspect of the process 80 involves the addition of start-of-sentence, end-of-sentence, start-of-paragraph and end-of-paragraph markers to the text, thereby producing the word grouping data (Figure 2A:46) which comprises sentence markers 86 and paragraph markers 87.
- the segmentation and normalisation process 80 is conventional, a fuller description of it can be found in Edgington M et al: Overview of current text-to-speech techniques part 1 - 'text and linguistic analysis' , BT Technology Journal, Volume 14, No. 1 , pp 68-83 (January 1 996) .
- the disclosure of that paper hereinafter referred to as part 1 of the BTTJ article) is hereby incorporated herein by reference.
- the computer is then controlled by the program to run a pronunciation and tagging process 90 which converts the expanded text file 88 to an unresolved phonetic transcription file 92 and adds tags 93 to words indicating their syntactic characteristics (or a plurality of possible syntactic characteristics).
- the process 90 makes use of the lexicon 44 which outputs possible word tags 93 and corresponding phonetic transcriptions of input words.
- the phonetic transcription 92 is unresolved to the extent that some words (e.g. 'live') are pronounced differently when playing different roles in a sentence.
- the pronunciation process is conventional - more details are to be found in part 1 of the BTTJ article.
- the program then causes the computer to run a conventional parsing process 94.
- a more detailed description of the parsing process can be found in part 1 of the BTTJ article.
- the parsing process 94 begins with a stochastic tagging procedure which resolves the syntactic characteristic associated with each one of the words for which the pronunciation and tagging process 90 has given a plurality of possible syntactic characteristics.
- the unresolved word tags data 93 is thereby turned into word tags data 95. Once that has been done, the correct pronunciation of the word is identified to form phonetic transcription data 97.
- the parsing process 94 then assigns syntactic labels 96 to groups of words.
- SENTSTART and SENTEND represent the sentence markers 86, RR, NP1 etc. represent the word tag data 95, and ⁇ ADV ADV > , (NR NR) etc. represent the syntactic groups 96.
- the meanings of the word tags used in this description will be understood by those skilled in the art - a subset of the word tags used is given in Table 1 below, a full list can be found in Garside, R., Leech, 0. and Sampson, G . eds 'The Computation Analysis of English : A Corpus based Approach', Longman ( 1 987) .
- chunking process 98 the program controls the computer to label 'chunks' in the input sentence.
- the syntactic groups shown in Table 2 below are identified as chunks.
- NR noun phrase (non referent) (NR Dinamo NPI Kiev_NP1 NR)
- Chunks are regarded as elements, as are sentence markers, paragraph markers, punctuation marks and words which do not fall inside chunks. Each chunk has a marker applied to it which identifies it as a chunk. These markers constitute chunk markers 99.
- phrasetagf(NR) Chinese_NP1 phrasetag([VG) became VVD phrasetag( ⁇ ADJ) popular JJ phrasetagtpp afterJCS (NR a_AT1 rumour_NN 1 NR) pp] phrasetag[VG got_WD about_RP VG] that_CST phrasetag(NR Mrs NNSB1 Thatcher NP1 NR) phrasetag[VG had_ VHD declared_WN VG] phrasetag(NR oper 1_JJ house_NNL1 NR)
- the computer then carries out classification process 100 under control of the program.
- the classification process 100 uses a classification of words and pronunciation database 100A.
- the classification database 100A is the fifth of the five databases stored on the CD-ROM 32.
- the classification database is divided into classes which broadly correspond to parts- of-speech.
- verbs, adverbs and adjectives are classes of words. Punctuation is also treated as a class of words.
- the classification is hierarchical, so many of the classes of words are themselves divided into sub-classes.
- the subclasses contain a number of word categories which correspond to the word tags 95 applied to words in the input text 40 by the parsing process 94. Some of the sub- classes contain only one member, so they are not divided further.
- Part of the classification (the part relating to verbs, prepositions and punctuation) used in the present embodiment is given in Table 4 below.
- the left-hand column of Table 4 contains the classes, the central column contains the sub-classes and the right-hand column contains the word categories.
- Figure 4 shows part of the classification of verbs.
- the class of words 'verbs' includes four sub-classes, one of which contains only the word category 'RP' .
- the other sub-classes ('beverbs', 'doverbs', and 'past') each contain a plurality of word categories.
- the sub-class 'doverbs' contains the word categories corresponding to the word tags VDO, VDG, VDN, and VDZ.
- the computer first identifies a core word contained within each chunk in the input text 40.
- the core word in a prepositional chunk i.e. one labelled 'pp' or 'vpp'
- the core word in a chunk labelled 'WH' or 'WHADV is the first word in the chunk.
- the core word is the last word in the chunk.
- the computer uses the classification of words 100A to label each chunk with the class, sub-class and word category of the core word.
- Each non-chunk word is similarly labelled on the basis of the classification of words 100A, as is each piece of punctuation.
- the classifications 101 for the elements generated by the classification process 100 are stored in RAM 1 2.
- the syntactic information 48 and word grouping data 46 are stored in the RAM 1 2 by the text analysis process 42.
- the syntactic information 48 comprises word tags 95, syntactic groups 96, chunk markers 99 and element classifications 101 .
- the word grouping data comprises the sentence markers 86 and paragraph markers 87.
- each of the reference sentences within the corpus is divided into elements and has similar syntactic information relating to each of the elements contained within it. Furthermore, the corpus contains data indicating where a human would insert prosodic boundaries when reading each of the example sentences. The type of the boundary is also indicated.
- the prosodic structure prediction process 50 involves the computer in finding the sequence of elements in the corpus which best matches a search sequence taken from the input sentence.
- the degree of matching is found in terms of syntactic characteristics of corresponding elements, length of the elements in words and a comparison of boundaries in the reference sentence and those already predicted for the input sentence.
- the process 50 will now be described in more detail with reference to Figure 5.
- FIG. 5 shows that the process 50 begins with the calculation of measures of similarity between each element of the input sentence and each element of the corpus 52. This part of the program is presented in the form of pseudo-code below:
- ⁇ increments from 1 to the number of elements in the input sentence, and er increments from 1 to the number of elements in the corpus.
- the program controls the computer to find: a) whether the core words of the two elements are in the same class;
- a match in both cases might, for example, be given a score of 2, a score of 1 being given for a match in one case, and a score of 0 being given otherwise.
- the program controls the computer to find to what level of the hierarchical classification the corresponding words in the elements are syntactically similar.
- a match of word categories might be given a score of 5, a match of sub-classes a score of 2 and a match of classes a score of 1 .
- the reference sentence has [VG is_VBZ argued VVN VG] and the input sentence has [VG was_VBDZ beaten_WN VG] then 'is VBZ' only matches 'was_VBDZ' to the extent that both are classified as verbs. Therefore a score of 1 would be given on the basis of the first word.
- the second word 'beaten VVN' and 'argued VVN' fall into identical word categories and hence would be given a score of 5. The two scores are then added to give a total score of 6.
- the third component of each element similarity measure is the negative magnitude of the difference in the number of words in the reference element, er, and the number of words in the element of the input sentence, e ⁇ . For example, if an element of the input sentence has one word and an element of the reference sequence has three words, then the third component is -2.
- a weighted addition is then performed on the three components to yield an element similarity measure (match(e ⁇ ,er) in the above pseudo-code).
- the table calculation step 102 results in the generation of a table giving element similarity measures between every element in the corpus 52 and every element in the input sentence.
- a subject element counter (m) is initialised to 1 .
- the value of the counter indicates which of the elements of the input sentence is currently subject to a determination of whether it is to be followed by a boundary.
- the program controls the computer to execute an outermost loop of instructions (steps 104 to 125) repeatedly. Each iteration of the outermost loop of instructions corresponds to a consideration of a different subject element of the input sentence.
- each execution of the final instruction (step 1 25) in the outermost loop results in the next iteration of the outermost loop looking at the element in the input sentence which immediately follows the input sentence element considered in the previous iteration.
- Step 1 24 ensures that the outermost loop of instructions ends once the last element in the input sentence has been considered.
- the outermost loop of instructions begins with the setting of a best match value to zero (step 104) . Also, a current reference element count (e r ) is initialised to 1 (step 106).
- the program controls the computer to repeat some or all of an intermediate loop of instructions (steps 108 to 1 21 ) as many times as there are elements in the prosodic structure corpus 52.
- Each iteration of the intermediate loop of instructions (steps 108 to 1 21 ) therefore corresponds to a particular subject element in the input sentence (determined by the current iteration of the outermost loop) and a particular reference element in the corpus 52 (determined by the current iteration of the intermediate loop).
- Steps 1 20 and 1 21 ensure that the intermediate loop of instructions (steps 108 to 1 21 ) is carried out for every element in the corpus 52 and ends once the final element in the corpus has been considered.
- the intermediate loop of instructions starts by defining (step 108) a search sequence around the subject element of the input sentence.
- srch seq start mind , m - no of elements before)
- srch seq end max( no of input sentence elements, m + no_of_elements_after)
- no of elements before is chosen to be 10
- no_of_elements_after is chosen to be 4. It will be realised that the search sequence therefore includes the current element m, up to 10 elements before it and up to 4 elements after it.
- a sequence similarity measure is reset to zero.
- a measure of the similarity between the search sequence and a sequence of reference elements is calculated.
- the reference sequence has the current reference element (i.e. that set in the previous execution of step 121 ) as it core element.
- the reference sequence contains this core element as well as the four elements that precede it and the ten elements that follow it (i.e. the reference sequence is of the same length as the search sequence).
- the calculation of the sequence similarity measure involves carrying out first and second innermost loops of instructions. Pseudo-code for the first innermost loop of instructions is given below:
- the subject element of the input sentence (set in step 103 or 1 25) is aligned with the core reference element. Once those elements are aligned, the element similarity measure between each element of the search sequence and the corresponding element in the reference sequence is found. A weighted addition of those element similarity measures is then carried out to obtain a first component of a sequence similarity measure. The measures of the degree of matching are found in the values obtained in step 102. The weight applied to each of the constituent element matching measures generally increases with proximity to the subject element of the input sentence. Those skilled in the art will be able to find suitable values for the weights by trial and error.
- the second innermost loop of instructions then supplements the sequence similarity measure by taking into account the extent to which the boundaries (if any) already predicted for the input sentence match the boundaries present in the reference sequence. Only the part of the search sequence before the subject element is considered since no boundaries have yet been predicted for the subject element or the elements which follow it. Pseudo-code for the second innermost loop of instructions is given below:
- the boundary matching measure between two elements is set to two if both the input sentence and the reference sentence have a boundary of the same type after the qth element, one if they have boundaries of different types, zero if neither has a boundary, minus one if one has a minor boundary and the other has none, and minus two if one has a strong boundary and the other has none.
- a weighted addition of the boundary matching measures is applied, those inter-element boundaries close to the current element being given a higher weight. The weights are chosen so as to penalise heavily sentences whose boundaries do not match.
- the carrying out of the first and second innermost loop of instructions results in the generation of a sequence similarity measure for the subject element of the input sentence and the reference element of the corpus 52. If the sequence similarity measure is the highest yet found for the subject element of the input sentence, then the best match value is updated to equal that measure (step 1 16) and the number of the associated element is recorded (step 1 18). Once the final element has been compared, the computer ascertains whether the core element in the best matching sequence has a boundary after it. If it does, a boundary of a similar type is placed into the input sentence at that position (step 1 22).
- boundaries are predicted on the basis of the ten best matching sequences in the prosodic structure corpus. If the majority of those ten sequences feature a boundary after the current element then a boundary is placed after the corresponding element in the input sentence.
- pattern matching was carried out which compared an input sentence with sequences in the corpus that included sequences bridging consecutive sentences.
- Alternative embodiments can be envisaged, where only reference sequences which lie entirely within a sentence are considered.
- a further constraint can be placed on the pattern matching by only considering reference sequences that have an identical position in the reference sentence to the position of the search sequence in the input sentence.
- Other search algorithms will occur to those skilled in the art.
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/913,462 US6996529B1 (en) | 1999-03-15 | 2000-03-08 | Speech synthesis with prosodic phrase boundary information |
AU29316/00A AU2931600A (en) | 1999-03-15 | 2000-03-08 | Speech synthesis |
CA002366952A CA2366952A1 (en) | 1999-03-15 | 2000-03-08 | Speech synthesis |
EP00907852A EP1163663A2 (en) | 1999-03-15 | 2000-03-08 | Speech synthesis |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB9905904.0 | 1999-03-15 | ||
GBGB9905904.0A GB9905904D0 (en) | 1999-03-15 | 1999-03-15 | Speech synthesis |
EP99305349 | 1999-07-06 | ||
EP99305349.5 | 1999-07-06 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2000055842A2 true WO2000055842A2 (en) | 2000-09-21 |
WO2000055842A3 WO2000055842A3 (en) | 2000-12-21 |
Family
ID=26153528
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2000/000854 WO2000055842A2 (en) | 1999-03-15 | 2000-03-08 | Speech synthesis |
Country Status (5)
Country | Link |
---|---|
US (1) | US6996529B1 (en) |
EP (1) | EP1163663A2 (en) |
AU (1) | AU2931600A (en) |
CA (1) | CA2366952A1 (en) |
WO (1) | WO2000055842A2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1668630A1 (en) * | 2003-09-29 | 2006-06-14 | Motorola, Inc. | Improvements to an utterance waveform corpus |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7725307B2 (en) * | 1999-11-12 | 2010-05-25 | Phoenix Solutions, Inc. | Query engine for processing voice based queries including semantic decoding |
US7392185B2 (en) * | 1999-11-12 | 2008-06-24 | Phoenix Solutions, Inc. | Speech based learning/training system using semantic decoding |
US7050977B1 (en) | 1999-11-12 | 2006-05-23 | Phoenix Solutions, Inc. | Speech-enabled server for internet website and method |
US9076448B2 (en) * | 1999-11-12 | 2015-07-07 | Nuance Communications, Inc. | Distributed real time speech recognition system |
KR100463655B1 (en) * | 2002-11-15 | 2004-12-29 | 삼성전자주식회사 | Text-to-speech conversion apparatus and method having function of offering additional information |
US7328157B1 (en) * | 2003-01-24 | 2008-02-05 | Microsoft Corporation | Domain adaptation for TTS systems |
JP4407305B2 (en) * | 2003-02-17 | 2010-02-03 | 株式会社ケンウッド | Pitch waveform signal dividing device, speech signal compression device, speech synthesis device, pitch waveform signal division method, speech signal compression method, speech synthesis method, recording medium, and program |
US7937263B2 (en) * | 2004-12-01 | 2011-05-03 | Dictaphone Corporation | System and method for tokenization of text using classifier models |
CN101202041B (en) * | 2006-12-13 | 2011-01-05 | 富士通株式会社 | Method and device for making words using Chinese rhythm words |
US8583438B2 (en) * | 2007-09-20 | 2013-11-12 | Microsoft Corporation | Unnatural prosody detection in speech synthesis |
US10957310B1 (en) | 2012-07-23 | 2021-03-23 | Soundhound, Inc. | Integrated programming framework for speech and text understanding with meaning parsing |
US11295730B1 (en) | 2014-02-27 | 2022-04-05 | Soundhound, Inc. | Using phonetic variants in a local context to improve natural language understanding |
RU2639684C2 (en) * | 2014-08-29 | 2017-12-21 | Общество С Ограниченной Ответственностью "Яндекс" | Text processing method (versions) and constant machine-readable medium (versions) |
US10095686B2 (en) * | 2015-04-06 | 2018-10-09 | Adobe Systems Incorporated | Trending topic extraction from social media |
US11210470B2 (en) * | 2019-03-28 | 2021-12-28 | Adobe Inc. | Automatic text segmentation based on relevant context |
CN112071300B (en) * | 2020-11-12 | 2021-04-06 | 深圳追一科技有限公司 | Voice conversation method, device, computer equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5463713A (en) * | 1991-05-07 | 1995-10-31 | Kabushiki Kaisha Meidensha | Synthesis of speech from text |
EP0821344A2 (en) * | 1996-07-25 | 1998-01-28 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus for synthesizing speech |
EP0833304A2 (en) * | 1996-09-30 | 1998-04-01 | Microsoft Corporation | Prosodic databases holding fundamental frequency templates for use in speech synthesis |
US5832435A (en) * | 1993-03-19 | 1998-11-03 | Nynex Science & Technology Inc. | Methods for controlling the generation of speech from text representing one or more names |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0680653B1 (en) * | 1993-10-15 | 2001-06-20 | AT&T Corp. | A method for training a tts system, the resulting apparatus, and method of use thereof |
US5913193A (en) * | 1996-04-30 | 1999-06-15 | Microsoft Corporation | Method and system of runtime acoustic unit selection for speech synthesis |
US5950162A (en) * | 1996-10-30 | 1999-09-07 | Motorola, Inc. | Method, device and system for generating segment durations in a text-to-speech system |
JP3587048B2 (en) * | 1998-03-02 | 2004-11-10 | 株式会社日立製作所 | Prosody control method and speech synthesizer |
DE69940747D1 (en) * | 1998-11-13 | 2009-05-28 | Lernout & Hauspie Speechprod | Speech synthesis by linking speech waveforms |
GB2376394B (en) * | 2001-06-04 | 2005-10-26 | Hewlett Packard Co | Speech synthesis apparatus and selection method |
-
2000
- 2000-03-08 US US09/913,462 patent/US6996529B1/en not_active Expired - Lifetime
- 2000-03-08 CA CA002366952A patent/CA2366952A1/en not_active Abandoned
- 2000-03-08 WO PCT/GB2000/000854 patent/WO2000055842A2/en active Application Filing
- 2000-03-08 AU AU29316/00A patent/AU2931600A/en not_active Abandoned
- 2000-03-08 EP EP00907852A patent/EP1163663A2/en not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5463713A (en) * | 1991-05-07 | 1995-10-31 | Kabushiki Kaisha Meidensha | Synthesis of speech from text |
US5832435A (en) * | 1993-03-19 | 1998-11-03 | Nynex Science & Technology Inc. | Methods for controlling the generation of speech from text representing one or more names |
EP0821344A2 (en) * | 1996-07-25 | 1998-01-28 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus for synthesizing speech |
EP0833304A2 (en) * | 1996-09-30 | 1998-04-01 | Microsoft Corporation | Prosodic databases holding fundamental frequency templates for use in speech synthesis |
Non-Patent Citations (3)
Title |
---|
KIM ET AL.: "Prediction of prosodic phrase boundaries considering variable speaking rate" INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP '96), 3 October 1996 (1996-10-03) - 6 October 1906 (1906-10-06), pages 1505-1508 vol.3, XP002124437 PHILADELPHIA, PA, US ISBN: 0-7803-3555-4 * |
WANG ET AL.: "Predicting intonational boundaries automatically from text: the ATIS domain" PROCEEDINGS OF THE DARPA SPEECH AND NATURAL LANGUAGE WORKSHOP, February 1991 (1991-02), pages 378-383, XP000856817 cited in the application * |
ZHU ET AL.: "Learning mappings between Chinese isolated syllables and syllables in phrase with back propagation neural nets" PROCEEDINGS OF THE 1998 ARTIFICIAL NETWORKS IN ENGINEERING CONFERENCE, vol. 8, 1 - 4 November 1998, pages 723-727, XP000856953 ST.LOUIS, MO, US * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1668630A1 (en) * | 2003-09-29 | 2006-06-14 | Motorola, Inc. | Improvements to an utterance waveform corpus |
EP1668630A4 (en) * | 2003-09-29 | 2008-04-23 | Motorola Inc | Improvements to an utterance waveform corpus |
Also Published As
Publication number | Publication date |
---|---|
EP1163663A2 (en) | 2001-12-19 |
AU2931600A (en) | 2000-10-04 |
CA2366952A1 (en) | 2000-09-21 |
US6996529B1 (en) | 2006-02-07 |
WO2000055842A3 (en) | 2000-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hirschberg | Pitch accent in context predicting intonational prominence from text | |
US8015011B2 (en) | Generating objectively evaluated sufficiently natural synthetic speech from text by using selective paraphrases | |
US6996529B1 (en) | Speech synthesis with prosodic phrase boundary information | |
Allen | Synthesis of speech from unrestricted text | |
Sproat et al. | A corpus-based synthesizer. | |
Qian et al. | Automatic prosody prediction and detection with conditional random field (crf) models | |
CA2614840C (en) | System, program, and control method for speech synthesis | |
US7136802B2 (en) | Method and apparatus for detecting prosodic phrase break in a text to speech (TTS) system | |
US20120191457A1 (en) | Methods and apparatus for predicting prosody in speech synthesis | |
RU2421827C2 (en) | Speech synthesis method | |
US7069216B2 (en) | Corpus-based prosody translation system | |
US20100250254A1 (en) | Speech synthesizing device, computer program product, and method | |
Chou et al. | Automatic generation of prosodic structure for high quality Mandarin speech synthesis | |
Stöber et al. | Speech synthesis using multilevel selection and concatenation of units from large speech corpora | |
Carlson et al. | Linguistic processing in the KTH multi-lingual text-to-speech system | |
JP4636673B2 (en) | Speech synthesis apparatus and speech synthesis method | |
JP4004376B2 (en) | Speech synthesizer, speech synthesis program | |
Dong et al. | Pitch contour model for Chinese text-to-speech using CART and statistical model | |
Apel et al. | Have a break! Modelling pauses in German speech | |
Pan | Prosody modeling in concept-to-speech generation | |
Rangarajan et al. | Acoustic-syntactic maximum entropy model for automatic prosody labeling | |
JP2004138661A (en) | Voice piece database creating method, voice synthesis method, voice piece database creator, voice synthesizer, voice database creating program, and voice synthesis program | |
Chou et al. | Selection of waveform units for corpus-based Mandarin speech synthesis based on decision trees and prosodic modification costs | |
Lee et al. | Automatic corpus-based tone and break-index prediction using k-tobi representation | |
Ferri et al. | A complete linguistic analysis for an Italian text-to-speech system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 09913462 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2000907852 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2366952 Country of ref document: CA Ref country code: CA Ref document number: 2366952 Kind code of ref document: A Format of ref document f/p: F |
|
WWP | Wipo information: published in national office |
Ref document number: 2000907852 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |