The present invention relates generally to speech recognition and speech synthesis systems. More particularly, the invention relates to pronunciation generation.
Computer-implemented and automated speech technology today involves a confluence of many areas of expertise, ranging from linguistics and psycho-acoustics, to digital signal processing and computer science. The traditionally separate problems of text-to-speech (TTS) synthesis and automatic speech recognition (ASR) actually present many opportunities to share technology. Traditionally, however, speech recognition and speech synthesis have been addressed as entirely separate disciplines, relying very little on the benefits that cross-pollination could have on both disciplines.
We have discovered techniques, described in this document, for combining speech recognition and speech synthesis technologies to the mutual advantage of both disciplines in generating pronunciation dictionaries. Having a good pronunciation dictionary is key to both text-to-speech and automatic speech recognition applications. In the case of text-to-speech, the dictionary serves as the source of pronunciation for words entered by graphemic or spelled input. In automatic speech recognition applications the dictionary serves as the lexicon of words that are known by the system. When training the speech recognition system, this lexicon identifies how each word is phonetically spelled, so that the speech models may be properly trained for each of the words.
In both speech synthesis and speech recognition applications, the quality and performance of the application may be highly dependent on the accuracy of the pronunciation dictionary. Typically it is expensive and time consuming to develop a good pronunciation dictionary, because the only way to obtain accurate data has heretofore been through use of professional linguists, preferably a single one to guarantee consistency. The linguist painstakingly steps through each word and provides its phonetic transcription.
Phonetic pronunciation dictionaries are available for most of the major languages, although these dictionaries typically have a limited word coverage and do not adequately handle proper names, unusual and compound nouns, or foreign words. Publicly available dictionaries likewise fall short when used to obtain pronunciations for a dialect different than the one for which the system was trained or intended.
Currently available dictionaries also rarely match all of the requirements of a given system. Some systems (such as text-to-speech systems) need high accuracy; whereas other systems (such as some automatic speech recognition systems) can tolerate lower accuracy, but may require multiple valid pronunciations for each word. In general, the diversity in system requirements compounds the problem. Because there is no "one size fits all" pronunciation dictionary, the construction of good, application-specific dictionaries remains expensive.
The present invention provides a system and method for automatically generating phonetic transcriptions, with little or no human involvement, depending on the desired accuracy of the dictionary. The invention provides a tool by which the user can specify a confidence level and the system automatically stores in the dictionary all generated pronunciations that fulfill the desired confidence level. Unlike other phonetic transcription tools, the invention requires no specific linguistic or phonetic knowledge to produce a pronunciation dictionary. The system can generate multiple pronunciations at different confidence levels, as needed, based on the requirements of the speech system being developed.
One powerful advantage of the system and method of the invention is that it uses multiple sources of information to synergistically achieve superior results. Integrating information from various dimensions gives a result that is greater than the sum of its parts. Moreover, different words may be handled by different methods, resulting in a superior final product. A non-exhaustive list of information sources applicable to the present invention includes: expert systems based on letter-to-sound rules, on-line dictionaries, morph dictionaries with morph combining rules, trainable learning subsystems, dialect transformation rules, and output from automatic speech recognition, from an operator's voice or from other audio sources.
In accordance with one aspect of the invention, a trainable learning sub-system is included that can adapt or improve as new pronunciation information is available. The trainable learning sub-system will adapt to a speaker, for example, making it easy to adapt a lexicon to a new dialect.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.