Speech recognition systems usually rely on a fixed lexicon of pronunciations written by a linguist. However, many applications require new words to be added to the vocabulary or new pronunciations of in-vocabulary words (i.e., words currently in the vocabulary of the speech recognition system, as compared to out-of-vocabulary words which are words not currently in the vocabulary of the system) to be added to the lexicon, hence the need for techniques which can automatically derive “phonetic baseforms.” As is known, the pronunciation of a word is represented in the lexicon of a speech recognition system as a sequence of phones called a “phonetic baseform.” This occurs, for example: (i) in dictation systems that allow personalized vocabularies; (ii) in name dialer applications, where the user enrolls the names he wants to dial; and (iii) in any application where actual pronunciations differ from canonic pronunciations (like for non-native speakers), so that the robustness of linguist-written pronunciations needs to be improved.
In situations where the speech recognition engine is embedded in a small device, there may not be any interface media, such as a keyboard, to allow the user to enter the spelling of the words he wants to add in his/her personalized vocabulary. And even if such interface were to be available, the spellings may not be of very much help as these applications typically involve words the pronunciation of which is highly unpredictable, like proper names for example. In this context, it is difficult to use a priori knowledge, such as letter-to-sound rules in a reliable way. Consequently, the user is asked to utter once or twice the words he wants to add to his/her personalized vocabulary, and phonetic baseforms for these words are derived from the acoustic evidence provided by the user's utterances. These approaches usually rely on the combined use of: (i) an existing set of acoustic models of subphone units (a subphone unit is a fraction of a phone); and (ii) a model of transition between these subphone units (in the following, we refer to such model as a model of transition between subphone units).
The way to optimally combine these models is an open issue as it is not known in advance which of the models can most reliably describe the acoustic evidence observed for each new word to enroll. For example, when the enrolled words are proper names, the reliability of the model of transition between the subphones is questionable, since proper names do not follow strict phonotactic rules. On the other hand, for common words pronounced in a noisy environment, the model of transition between the phones may turn out to be more reliable than the acoustic models. Current implementations of automatic baseform generation do not take into consideration the relative degree of confidence that should be put into either component.