While speech to text applications have experienced a remarkable evolution in accuracy and usefulness during the past ten or so years, pleasant, natural sounding easily intelligible text to speech functionality remains an elusive but sought-after goal.
This remains the case despite what one might mistake as the apparent simplicity of converting known syllables with known sounds into speech, because of the subtleties of the audible cues in human speech, at least in the case of certain languages, such as English. In particular, while certain aspects of these audible cues have been identified, such as the increase in pitch at the end of a question which might otherwise be declaratory in form, more subtle expressions in pitch and energy, some speaker specific, some optional and general in nature, and still others word specific, combine with individual voice color in the human voice to result in realistic speech.
In accordance with the invention, elements of individual speaker color, randomness, and so forth are incorporated into output speech with varying degrees of implementation, to achieve a pseudo-random effect. In addition, speaker color is integrated with the same and combined with expressive models patterned on existing conventional speech coach to student voice training techniques. Such conventional techniques may include the Lessac system, which is aimed at improving intelligibility in the human voice in the context of theatrical and similar implementations of human speech.
In contrast to the inventive approach, conventional text to speech technology has concentrated on a mechanical, often high information density, approach. Perhaps the most convincing text to speech approach is the use of prerecorded entire phrases, such as those used in some of the more sophisticated telephone answering applications. An example of such an application is Wildfire (a trademark), a proprietary system available in the United States. In such systems, the objective is to minimize the number of dialog options in favor of prerecorded phrases with character, content and tonality having a nature which is convincing from an expressive standpoint. For example, such systems on recognizing an individual's voice and noting a match to the phone number might say: “Oh, hello Mr. Smith”, perhaps with an intonation of pleasure or surprise. On the other hand, if a voice recognition software in the system determines that the voice is not likely that of Mr. Smith, despite the fact that it has originated from his telephone line, the system may be programmed to say: “Is that you, Mr. Smith?”, but in an inquisitive tone. In the above examples, the above phrase spoken by a human speaker is recorded in its entirety. However, the amount of memory required for just a very few responses is relatively high and versatility is not a practical objective.
Still another approach is so-called “phrases placing” such as that disclosed in Donovan, U.S. Pat. No. 6,266,637, where recorded human speech in the form of phrases is used to construct output speech. In addition, in accordance with this technology, the characteristics of segments of speech may be modified, for example by modifying them in duration, energy and pitch. In related approaches, such as utterance playback, some of the problems of more limited systems are solved, such approaches tend to be both less intelligible and less natural than human speech. To a certain extent blending of prerecorded speech with synthetic speech will also solve some of these problems, but the output speech, while versatile and having wider vocabularies, is still relatively mechanical and character.
Still another approach is to break up speech into its individual sounds or phonemes, and then to synthesize words from these sounds. Such phonemes may be initially recorded human speech, but may have their characteristics varied so that the resulting phoneme has a different duration, pitch, energy or other characteristics or characteristics changed as compared to the original recording. Still another approach is to make multiple recordings of the phonemes, or integrate multiple recordings of words with word generation using phoneme building blocks.
Still a further refinement is the variation of the prosody, for example by independently changing the prosody of a voiced component and an unvoiced component of the input speech signal, as is taught by U.S. Pat. No. 6,253,182 of Acero. In addition, the frequency-domain representation of the output audio may be changed, as is also described in Acero.
Concatenative systems generate human speech by synthesizing together small speech segments to output speech units from the input text. These output speech units are then concatenated, or played together to form the final speech output by the system. Speech may be generated using phonemes, diphones (two phonemes) or triphones (three phonemes). In accordance with the techniques described by Acero, the prosody of the speech unit, defined by its pitch and duration, may be varied to convey meaning, such as in the increase in pitch at the end of a question.
Still other text to speech technology involves the implementation of technical pronunciation rules in conjunction with the text to speech transformation of certain combinations of certain consonants and/or vowels in a certain order. See for example U.S. Pat. No. 6,188,984 of Manwaring et al. One aspect of this approach is recognizing the boundaries between syllables and applying the appropriate rules.
As can be seen from the above, current approaches for text to speech applications proceed at one end of the spectrum from concatenated sentences, phrases and words to word generation using phonemes. While speech synthesis using sub-word units lends itself to large vocabularies, serious problems occur where sub-word units are spliced. Nevertheless, such an approach appears, at this time, to constitute the most likely model for versatile high vocabulary text to speech systems. Accordingly, addressing prosody issues is a primary focus. For example, in U.S. Pat. No. 6,144,939 of Pearson, the possibility of a source-filter model that closely ties the source and filter synthesizer components to physical structures within the human vocal tract is suggested. Filter parameters are selected to model vocal tract effects, while source waveforms model the glottal source. Pearson is concerned, apparently, with low memory systems, to the extent that full syllables are not even stored in the system, but rather half syllables are preferred. Interestingly, this approach mimics the Assyro-Babylonian alphabet approach which involved use of consonants with various vowel additions respectively before and after each consonant corresponding to sounds represented by individual alphabets.