This invention relates to the field of speech recognition and speech synthesis. This invention is particularly applicable to the generation of speech recognition dictionaries including phrasal transcriptions for use in speech recognition systems as may be used in a telephone directory assistance system, voice activated dialing (VAD) system, personal voice dialing system and other speech recognition enabled services. This invention is also applicable to text-to-speech synthesizers for generating suitable pronunciations of phrases.
Speech recognition enabled services are more and more popular today. The services may include stock quotes, directory assistance, reservations and many others.
In typical speech recognition systems, the user enters his request using isolated word, connected word or continuous speech via a microphone or telephone set. If valid speech is detected, the speech recognition layer of the system is invoked in an attempt to recognize the unknown utterance. Typically, entries in a speech recognition dictionary, usually including transcriptions associated to labels, are scored in order to determine the most likely match to the utterance. The recognition of speech involves aligning an input audio signal with the most appropriate target speech model. The target speech model for a particular vocabulary item is built by concatenating the speech models of the transcription or transcriptions associated to that particular vocabulary item.
Of particular interest here are speech recognizers capable of recognizing complete phrases. Speech recognition dictionaries used in such speech recognition systems often comprise transcriptions for complete phrases, herein designated as phrasal transcriptions. A phrasal transcription is a representation of the pronunciation of the associated complete phrase when uttered by a human. Each phrasal transcription is associated to a label indicative of the orthographic representation of the phrase, herein designated as the orthographic phrase. Typically, multiple phrasal transcriptions are provided for each orthographic phrase thereby allowing for different pronunciations of the phrase. A limit on the total number of phrasal transcriptions in a speech recognition dictionary is imposed due to the inherent computational limits of the speech recognizer as well as due to the memory requirements for storing the phrasal transcriptions. Typically, the limit on the total number of phrasal transcriptions is put into practice by limiting the maximum number of phrasal transcriptions stored for each phrase.
A number of methods have been explored for generating a set of phrasal transcriptions to be included in a speech recognition dictionary. Common methods make use of outer-product procedures to generate the set of phrasal transcriptions. In a typical interaction a group of word transcriptions is generated for each vocabulary item in the orthographic phrase. Following this, permutations of the word transcriptions are used to generate the phrasal transcription. A commonly used permuting rule, herein referred to as the F(i) permuting rule, can be mathematically defined as follows:       F    ⁡          (      i      )        =      {                                        1            +                                          ∏                                  x                  =                  1                                                  x                  =                                      i                    -                    1                                                              ⁢                              N                x                                                                                        for              ⁢                              xe2x80x83                            ⁢              i                         greater than             1                                                1                                                    for              ⁢                              xe2x80x83                            ⁢              i                        =            1                              
where Ni is the number of word transcriptions in the group of word transcriptions associated with the ith vocabulary item of the orthographic phrase. This permuting rule permutes the ith vocabulary item every F(i) phrasal transcription. A specific example will better illustrate this permuting rule. Consider the following orthographic phrase xe2x80x9cMary""s little lambxe2x80x9d comprising three vocabulary items namely xe2x80x9cMary""s xe2x80x9d, xe2x80x9clittlexe2x80x9d and xe2x80x9clambxe2x80x9d. The vocabulary items are transcribed using a standard word transcription tool and yield a group of word transcriptions for each vocabulary item.
Mary""s (i=1) -- greater than /mEriz/, /Ariz/, m*riz/
little (i=2) -- greater than /lIt*l/, lId*l/, /lIt*/, lId*/
lamb (i=3) -- greater than /lamb/, /lam/
Each word transcription has a word transcription probability associated to it. In this specific example, the word transcription probabilities are as follows:
p(/mEriz/|xe2x80x9cMary""s)=0.7
p(/mAriz/|xe2x80x9cMary""sxe2x80x9d)=0.2
p(/m*riz/|xe2x80x9cMary""sxe2x80x9d)=0.1
p(/lIt*l/|xe2x80x9clittlexe2x80x9d)=0.46
p(/lId*l/|xe2x80x9clittlexe2x80x9d)=0.44
p(/lIt*/|xe2x80x9clittlexe2x80x9d)=0.06
p(/lId*/|xe2x80x9clittlexe2x80x9d)=0.04
p(/lamb/|xe2x80x9clambxe2x80x9d)=0.6
p(/lam/|xe2x80x9clambxe2x80x9d)=0.4
The word transcription probabilities are used to order and truncate the list of word transcriptions. Typically, the word transcriptions are sorted by likelihood, meaning that the first word transcription has a highest transcription probability. Assuming a word transcription limit of 2 word transcriptions per vocabulary item, the two word transcriptions having the highest score are kept and the remaining word transcriptions are discarded. In this specific example this results in the following word transcription groups for the vocabulary items in the orthographic phrase:
Mary""s -- greater than /mEriz/, /mAriz/
little -- greater than lIt*l/, lId*l/
lamb -- greater than /lamb/, /lam/
In the above word transcription groups, the 3rd word transcription for xe2x80x9cMary""sxe2x80x9d and the 3rd and 4th word transcriptions for xe2x80x9clittlexe2x80x9d have been deleted from the original list. The word transcriptions are then permuted according to the F(i) permuting rule and concatenated leading to the following phrasal transcriptions:
mEriz lIt*l lamb
mAriz lIt*l lamb
mEriz lId*l lamb
mAriz lId*l lamb
mEriz lIt*l lam
mAriz lIt*l lam
mEriz lId*l lam
mAriz lId*l lam
For this specific example, the F(i) permuting rule generated eight permutations of the word transcriptions, with variations of the first word transcription occurring between each phrasal transcription, with variations of the second word transcription occurring every second phrasal transcription and variations of the third word transcription occurring every fourth phrasal transcription. Assuming a phrasal transcription limit of 4 transcriptions per phrase, we then have:
mEriz lIt*l lamb
mAriz lIt*l lamb
mEriz lId*l lamb
mAriz Ild*l lamb
A deficiency of the above-described method is that it emphasizes variations from left-to-right. More specifically, the vocabulary item in the first position in the phrase, in the set of selected phrasal transcriptions, has its word transcriptions permuted several times while vocabulary items appearing later on in the phrase are varied less frequently or not at all as the above example illustrates. Consequently, variations in pronunciations for vocabulary items appearing later in a phrase is modeled less effectively that variations for vocabulary items appearing closer to the beginning of a phrase.
Another deficiency of the above noted method is that it does not reflect any probability information associated to the word transcriptions other than to truncate the groups of word transcriptions. Additionally, the above-described method does not provide any mechanism for including language probability information in the selection of the set of phrasal transcriptions.
Thus, there exists a need in the industry to refine the process of selecting a set of transcriptions such as to obtain an improved set of phrasal transcriptions capable of being used by speech recognition dictionary or by a text to speech synthesizer.
The present invention is directed to the generation of phrasal transcriptions.
In accordance with a broad aspect, the invention provides a method for generating a set of phrasal transcriptions suitable for use in a speech recognition dictionary. The method comprises providing an orthographic phrase comprising a set of vocabulary items. The method further comprises generating a group of word transcriptions for each vocabulary item in the orthographic phrase, each word transcription in the group of word transcriptions for a given vocabulary item being associated to an ordering data element. The ordering data elements establish a relationship between the word transcriptions in the group of word transcriptions. The method further comprises permuting the word transcriptions to generate a plurality of phrasal transcriptions, each word transcription of a phrasal transcription in the plurality of phrasal transcriptions being selected from the group of word transcriptions associated to the corresponding vocabulary item. The method further comprises computing a score data element for each phrasal transcription in the plurality of phrasal transcriptions on a basis of ordering data elements associated to the word transcriptions in a phrasal transcription. The set of phrasal transcriptions is then selected from the plurality of phrasal transcriptions at least in part on a basis of the score data elements. The set of phrasal transcriptions is then stored in a format suitable for use by a speech recognition dictionary.
In accordance with another broad aspect, the invention further provides an apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention provides a computer readable medium containing a program element suitable for execution by a computing apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention further provides a computer readable medium containing a speech recognition dictionary comprising phrasal transcriptions generated by the above-described method.
An advantage of the present invention is that variations in word transcriptions do not depend on the position of the word but on the score data element associated to the phrasal transcriptions, the score data elements being derived on a basis of ordering data elements.
In a specific example of implementation, the ordering data elements are word transcription probabilities. Advantageously, the use of word transcription probabilities in computing the score data elements allows reflecting probability information associated to the word transcriptions in the selection of the set of phrasal transcriptions. Consequently, variations in pronunciations for vocabulary items are not dependent on the position of the vocabulary item in the phrase.
Preferably but not essentially, each word transcription is associated to a language probability data element, the score data element being further derived on a basis of the language probability data element. Alternatively, each phrasal transcription is associated to a language probability data element, the score data element being further derived on a basis of the language probability data element.
Advantageously, the use of language probability in the computation of the score data element provides a mechanism for including language probability information in the selection of the set of phrasal transcriptions.
In accordance with another broad aspect, the invention provides a method for generating a set of phrasal transcriptions for use in a speech recognition dictionary. The method comprises providing an orthographic phrase comprising a set of vocabulary items. The method further comprises generating for each vocabulary item in the set of vocabulary items a group of word transcriptions. A group of word transcriptions comprises Ni word transcriptions where i is the position of the vocabulary item in the orthographic phrase to which the group of word transcriptions is associated. The method further comprises permuting the word transcriptions to generate the set of phrasal transcriptions, each word transcription of a phrasal transcription of the set of phrasal transcriptions being selected from the group of word transcriptions associated to the corresponding vocabulary item. Permuting the word transcriptions is characterized by yielding a higher likelihood of variability between the word transcriptions associated to a common vocabulary item among the set of phrasal transcriptions than a permuting rule F(i) where i is an integer value indicative of the position of the vocabulary item in the orthographic phrase. The set of phrasal transcriptions is then stored in a format suitable for use by a speech recognition dictionary.
In accordance with another broad aspect, the invention provides an apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention provides a computer readable medium containing a program element suitable for execution by a computing apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention provides a computer readable medium containing a speech recognition dictionary. The speech recognition dictionary comprises a set of phrasal transcriptions associated to an orthographic phrase, the phrasal transcriptions being comprised of word transcriptions associated to respective vocabulary items in the orthographic phrase. The set of phrasal transcriptions is characterized in having higher variability between the word transcriptions associated to a common vocabulary item among the set of phrasal transcriptions than a permuting rule F(i) where i is an integer value indicative of the position of the vocabulary item in the orthographic phrase.
For the purpose of this specification the expression xe2x80x9cword transcriptionxe2x80x9d is used to designate the acoustic representation a vocabulary item as a sequence of sub-word units representative of a pronunciation of the vocabulary item. A number of acoustic sub-word units can be used in a transcription such as phonemes, allophones, triphones, syllables and dyads (demi-syllables). Commonly, the phoneme is used as the sub-word unit and the representation is designated as xe2x80x9cphonemic word transcriptionxe2x80x9d.
For the purpose of this specification the expression xe2x80x9cphrasal transcriptionxe2x80x9d is used to designate the acoustic representation a phrase as a sequence of word transcriptions. A phrasal transcription is representative of a pronunciation of the associated phrase.
For the purpose of this specification the expression xe2x80x9corthographic phrasexe2x80x9d is used to designate the representation of a phrase in the form of symbols from a language alphabet. An orthographic phrase can have many pronunciations, each pronunciation being associated to a respective phrasal transcription.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.