Speech recognition systems commonly recognize utterances by comparing the sequences of sounds associated with an utterance against acoustic models of different words. In many such systems, an acoustic word model is represented by a sequence of acoustic phoneme models corresponding to the phonetic spelling of the word.
This is illustrated in FIG. 1 in which speech sounds generated by user 102 are converted into an analog electrical signal 104. The analog electrical representation is converted by analog-to-digital and DSP circuitry 106 into a sequence 108 of acoustic parameter frames 110. Each parameter frame represents the value of each of a set of acoustic parameters of the utterance during a given time period, such as a fiftieth or a hundredth of a second. The parameters can include spectral or cepstral parameters of the frame's associated sound or parameters based on derivatives of such parameters.
This representation of the user's utterance is compared against each of a plurality of acoustic word models such as the acoustic word models 112 and 114, corresponding to the name “Fred” and “Brooks”, respectively, in FIG. 1. Each such word model is comprised of a sequence of acoustic phoneme models 116 corresponding to the sequence of individual phonemes 118 contained within the phonetic spelling 120 associated with each such word model.
In the example of FIG. 1, the acoustic phoneme models 116 are triphone models, each of which represents its phoneme as a sequence of three acoustic parameter models that represent the sequence of sounds associated with the phoneme when it occurs in the context of a given preceding and given following phoneme.
We use the word “phoneme” to represent a class of speech sounds, each represented by a symbol, where each of multiple letters of the text alphabet correspond to different phonemes in different words. This definition includes the sets of phonemes found in the phonetic word spellings of common dictionaries, but is not limited to such phoneme sets. This is because different linguists use different sets of phonemes to classify speech sounds; because speech recognition systems with access to different levels of computational and storage resources often use phoneme sets of different size, and because the classification of speech sounds most useful for a given speech recognition system might not be one most useful for humans trying to understand how to pronounce words.
In many speech recognition systems the phonetic spellings for all, or most, of the words the system can recognize have been provided as a fixed part of the system. In most current systems such pre-stored phonetic spellings have been obtained from a dictionary or other relatively reliable sources. However, it is often desirable for a speech recognition system be able to recognize words for which there is no pre-stored spelling.
For example, one context in which it is desirable to enable a user to add words to the recognition vocabulary is in cell phones that enable a user to voice dial, that is, to dial a person by speaking his or her name. Because there are a very large number of possible names for people (there are roughly two million different names in US phonebooks), and because most cell phone speech recognition systems have small vocabularies to enable them to fit into the relatively small memories of cellphones, it is currently impractical to include the phonetic spellings of all names in most voice dial cellphone.
The prior art has traditionally dealt with the problem of enabling a speech recognition system to recognize words that have been entered into them by using a pronunciation guesser. This is normally a computer program that models the rules for pronouncing names from their text spellings.
Algorithms used for pronunciation guessing can include algorithms ranging all the way from the very sophisticated to the very simple. For example, relatively sophisticated pronunciation guessing algorithms can include learning techniques such as hidden-markov-modeling or decision tree classifiers to develop statistical models of which phonemes or sequences of phonemes tend to be associated with which letters and sequences of letters.
In this application when we refer to a pronunciation guesser or a guessed pronunciation we intend to cover all such algorithms.
Because of the vagaries of language, some of the pronunciations predicted by a pronunciation algorithm will be incorrect. The association of an incorrect phonetic spelling with a word normally will reduce the chance that such word will be correctly recognized. This is because the acoustic model of the word, being based on an incorrect phonetic spelling, corresponds to a sequence of sounds different than the pronunciation of the word users are likely to actually say when seeking to have the word recognized.
The guessing of the pronunciation of people's names tends to be particularly difficult. This is in part because there are so many different names. As stated above, there are approximately two million names in US phonebooks. It is also because the pronunciation of names tends to be more irregular than the pronunciation of average words. Incorrect pronunciations of names exist because language styles shift and names change pronunciation over time; different dialects can have different pronunciations for the same text representation of a name; people with accents will not offer the same pronunciations as people with native fluency; foreign names may be pronounced inconsistently as the native speakers may not understand how to pronounce foreign names, and the same foreign name is often imported into English by different people using different rules from converting from their native language.
It has been standard practice to train acoustic phoneme models used in name recognition based on the phonetic spellings of a large number of words and/or names, with either a single or multiple pronunciations for each word. Some such system train acoustic models using both correct and known commonly mis-pronounced utterances of words.
A known common mis-pronunciation of a given word can be viewed, for purposes of speech recognition as a correct pronunciation, since it is a pronunciation that is commonly used by people to represent the given word. Thus, in this application and the claims that follow, we consider a known common mis-pronunciation of a word or name to be a correct pronunciation, and when we refer to incorrect pronunciations or phonetic spellings of words we mean to exclude known common mispronunciations of words.
It is possible that some recognition systems in the past may have trained acoustic data with phonetic spellings generated by pronunciation guessers in situations in which such a pronunciation guesser could achieve a low enough error rate that the effect of phonetic misspellings on the acoustic models trained would be minimal. Such a situation could have occurred in the training of acoustic models for US English words if the pronunciation guesser used was unusually accurate. It might also have occurred if the acoustic models being trained were for words of a language in which the letter-to-phoneme rules were highly regular, such as in Spanish, in which a relatively simple pronunciation guesser would be able to achieve surprisingly high degree of accuracy.