Speech synthesis and recognition systems are well-known. Although the dictionaries used in such systems are generally quite good for standard English and even for proper names, the dictionaries can never provide perfect performance, and user-provided exceptions must be allowed. For example, some customers use text-to-speech (TTS) systems for applications with specialized vocabularies, such as individual's names, pharmaceuticals, street names, company names, etc. This often requires that the dictionaries be customized by the end-users to include these specialized vocabularies. In addition, the fact that a single name may be pronounced differently in various geographic locales insures that customization is often necessary. For instance, the most obvious pronunciation of "Peabody" is not at all correct near Boston, Mass.; in Boston, the middle syllable of "Peabody" is deaccented, i.e., it becomes a very short "buh" sound. Currently, such corrections are made by typing the desired pronunciation into, e.g., an "exception" dictionary. The person making the entry must know the proper phonetic transcription of the word in terms of the particular phonetic alphabet used by the TTS system. End-users are typically not willing to devote the time and energy needed to become proficient in constructing alternative pronunciations.
This situation would be improved only marginally by allowing end-users to enter pronunciations in a standardized phonetic alphabet such as the International Phonetic Alphabet (IPA). Although it is a "standard", it is not necessarily any easier for end-users to learn, and usage by American linguists differs substantially from the official standard, leading to further confusion.
Other approaches have been suggested, generally involving much trial and error. For example, the end-user could use "creative spelling" to find an alternate way of typing the name so that the TTS happens to produce the desired result. The system could then enter the internally-generated phonetics into the exception dictionary. This often works, but takes time and may not always produce satisfactory results. Another approach is to automatically generate the "N-best" (i.e., most likely) pronunciations for a given word, let the end-user listen to all N attempts (as spoken by the TTS system), and identify which was closest. This also may involve quite a bit of time on the end-user's part, and the number of examples (N) may need to be very large to reach unusual or unexpected pronunciations.
Alternatively, the TTS system could simply record the corrected word as spoken by the end-user, but this falls short in several respects. First, the application must store the recordings of each corrected word forever, and such recordings require much more storage than printable transcriptions. Second, it must interrupt the synthetic voice to find and play the recorded word. Third, the recorded voice will not match the synthetic voice. In general, they will be different speakers, perhaps even different genders. Even if the speaker is unchanged, the recorded word is fixed and cannot easily change speaking rate, pitch, or other voice characteristics to match the current context. Lastly, only playback benefits from the recorded word: speech recognition accuracy is not improved.
As such, there is a need for an efficient method for end-users of speech synthesis and recognition systems to customize the pronunciations of words in the dictionaries used by those systems.