Pronunciation modeling is the process of assigning to each word in a given vocabulary a suitable sequence of phonemes (or phones). Good pronunciation modeling is critical to all speech-to-text and text-to-speech applications. In automatic speech recognition (ASR), the search module relies on phoneme sequences stored in the underlying (phonetic) dictionary to select appropriate acoustic models against which to score the input utterance. Similarly, in text-to-speech (TTS) synthesis, phonemic expansion is required for the selection of the proper TTS units from which to generate the desired waveform.
Given an input vocabulary, there are two complementary ways to construct a set of pronunciations. The most obvious method is to have one or several trained linguists manually create each entry. This is typically a time-consuming task, often prone to inconsistencies, and inherently dependent on the language considered. Yet, over the past decade, phonetic dictionaries have been in increasingly high demand as many spoken language applications reached large-scale deployment worldwide. As a result, medium (10,000) to large (100,000) dictionaries, of varying quality, are now available for most major languages.
The second method is to automatically derive pronunciations from the word orthography, e.g., the sequence of letters used to convey the word. This is necessary for the real-time processing of out-of-vocabulary words, for which no entry exists in the underlying dictionary. In fact, the dynamic nature of language makes it an instrumental part of any system. Automatic pronunciation modeling relies on a set of so-called letter-to-sound rules, whose role is to capture language regularity and properly encapsulate it within a small number of general principles. Letter-to-sound rules come in two basic flavors. They can be (hand-)written by trained linguists, as in the case of morphological decomposition for example; this approach, however, tends to suffer from the same drawbacks as above. Or they can be primarily data-driven, whereby a statistical algorithm leverages an existing dictionary to model the salient relationships between orthography and pronunciation.
In the latter case, the state-of-the-art is to train a decision tree to classify each letter sequence into the appropriate phoneme sequence. During training, the decision tree is presented with sequence pairs, aligned with the help of (language-dependent) letter mappings. During classification, the tree is traversed on the basis of questions asked about the context of each letter, until a leaf corresponding to a particular phoneme (or phoneme string) is reached. The final pronunciation is simply the concatenation of the results obtained at each leaf. Although data-driven, this procedure is not really unsupervised, since the letter mappings rely on expert human knowledge of the language considered.
“Letter-to-sound” rules work reasonably well for words that are fairly close to exemplars seen in training, but they often break down otherwise. The primary reason why current “letter-to-sound” rule implementations generalize poorly is that they attempt to capture language regularity and encapsulate it within a small number of general principles. This can be viewed as a “top-down” approach. The immediate consequence is that all rare phenomena are presumed “irregular” and, accordingly, ignored. This is a major drawback in a situation where most occurrences are infrequent, as in name pronunciation.
Perhaps even more importantly, by construction decision trees only ask contextual questions associated with phenomena that are sufficiently well represented in the training data. Contexts rarely seen in the underlying dictionary tend to be overlooked, regardless of their degree of similarity or relevance. For out-of-vocabulary words that largely conform to the general patterns of the language, as observed in the training words, this is relatively inconsequential. But many other words, such as names (especially those of foreign origin), may comprise a number of patterns rarely seen in the dictionary, for which this limitation may be more deleterious. To illustrate, consider the name of Indian origin “Krishnamoorthy,” whose correct pronunciation is given by:k r IH S n AX m 1UH r T IY  (1)using a standard phonetic notation known as AppleBet from Apple Computer, Inc., the assignee of the present invention. In contrast, the pronunciation produced by a typical letter-to-sound rule decision tree trained on 56,000 names of predominantly Western European origin and which produces average results, is given by:k r IH S n AE m 1UW UX r D IY  (2)In particular three errors stand out in (2): the schwa “AX” in sixth position is replaced by the full vowel “AE,” the penultimate unvoiced “T” is replaced by the voiced version “D,” and the stressed vowel “1UH” is replaced by the improper compound “1UW UX.” These errors can all be traced to the poor generalization properties of the decision tree framework. Specifically, the ending “UX r D IY” results from the influence of a number of names in the training dictionary ending in “orthy,” such as “Foxworthy.” The vowel compound comes from the inability of this pattern to account for “oo,” hence the awkward attempt to have it both ways by concatenating the two vowels. Finally, the full vowel “AE,” commonly seen in names like “McNamara,” points to an obvious failure to connect “Krishnamoorthy” with the more semantically related “Krishna.” This example underscores the importance of exploiting all relevant contexts, regardless of how sparsely seen they may have been in the training data, to increase the ability of a pronunciation model to generalize.