This invention relates in general to text-to-speech generators and, in particular, to an allophonic text-to-speech generator.
Many telephone assistance systems use pre-recorded words and announcements to assist callers. For example, a voice mail box may include a pre-recorded greeting with a space in the greeting for inserting the name of the mail box owner. Some systems are sophisticated enough to have a library of names that can be concatenated together from prerecorded voice files so that the same voice continuously speaks the announcement as well as the name of the called party.
Directory assistance systems are significantly more complex than voice mail systems. Directory assistance systems often require numerous individual announcements as well as a number of individual names, words, and phrases. These announcements, names, words and phrases must be recorded in advance. All recordings are made by one person so that the caller hears one voice.
It is time-consuming to create or modify existing announcement systems. In order to change any of the announcements or individual words, the audio file must be re-recorded. That may be impossible if the original voice talent who recorded the announcement is no longer available to make future recordings. Even if the voice talent is available, modifications are still labor-intensive. They require sessions for recording, editing and concatenating the talent's voice in order to generate the desired announcements and words.
Others have proposed text-to-speech generators (U.S. Pat. Nos. 4,872,202, 5,384,893, and 5,463,715) and systems that synthesize human voice from computer files (see, U.S. Pat. No. 4,602,152). The foregoing references show that it is possible to convert orthographic text into phonetic text and into speech, nevertheless, the voice quality of such systems is unacceptable.
Orthographic text is the spelling of a spoken word. Phonetic text includes approximately 40 phonemes for translating orthographic English to phonetic English. A phoneme is an abstract unit that forms a basis for writing down a language systematically and unambiguously. Phonemes of a language are the minimal set of units that describe all and only the variations between sounds that cause a difference in meaning between the words of a language. For example, the /p/ and /t/ phonemes in the words "pin" and "tin" are distinctively different phonemes. However, audible speech includes numerous minor but significant and detectable differences between phonemes. Allophones are a subset of phonemes that include subtle but distinct differences between allophones of the same phoneme. That difference refers to the variant forms of the phoneme. For example, the aspirated /p/ of the word "pit" and the inspirited /p/ of the word "spit" are allophones of the phoneme /p/.
In the references described above, others have translated orthographic text to phonetic text. After that translation, the phonetic text is converted to audio signals using, pre-recorded phonemes and allophonic information. Pre-recorded phonemes are modified in accordance with different computer programs that alter the frequency, pitch, cadence, and rhythm of the phoneme in order to add allophonic information to the recorded phoneme and generate a truer audio representation of the input text. However, those prior art systems have complex software and have failed to provide acceptable reproductions of human voice for operator assistance services. Accordingly, there is a long felt need for a reliable and less complex system which accurately produces audio signals representative of input orthographic text.