The present invention relates generally to text-to-speech synthesis. More particularly, the invention relates to a technique for applying prosody information to the synthesized speech using prosody templates, based on a tree-structured look-up technique.
Text-to-speech systems convert character-based text (e.g., typewritten text) into synthesized spoken audio content. Text-to-speech systems are used in a variety of commercial applications and consumer products, including telephone and voicemail prompting systems, vehicular navigation systems, automated radio broadcast systems, and the like.
There are a number of different techniques for generating speech from supplied input text. Some systems use a model-based approach in which the resonant properties of the human vocal tract and the pulse-like waveform of the human glottis are modeled, parameterized, and then used to simulate the sounds of natural human speech. Other systems use short digitally recorded samples of actual human speech that are then carefully selected and concatenated to produce spoken words and phrases when the concatenated strings are played back.
To a greater or lesser degree, all of the current synthesis techniques sound unnatural unless prosody information is added. Prosody refers to the rhythmic and intonational aspects of a spoken language. When a human speaker utters a phrase or sentence, the speaker will usually, and quite naturally, place accents on certain words or phrases, to emphasize what is meant by the utterance. A text-to-speech apparatus can have great difficulty simulating the natural flow and inflection of the human-spoken phrase or sentence because the proper inflection cannot always be inferred from the text alone.
For example, in providing instructions to a motorist to turn at the next intersection, the human speaker might say “turn HERE,” emphasizing the word “here” to convey a sense of urgency. The text-to-speech apparatus, simply producing synthesized speech in response to the typewritten input text, would not know whether a sense of urgency was warranted, or not. Thus the apparatus would not place special emphasis on one word over the other. In comparison to the human speech, the synthesized speech would tend to sound more monotone and monotonous.
In an effort to inject more realism into synthesized speech, it is now possible to provide the text-to-speech synthesizer with additional prosody information, which is used to alter the way the synthesizer output is generated to give the resultant speech more natural rhythmic content and intonation.
In the typical speech synthesizer, prosody information affects the pitch contours and/or duration values of the sounds being generated in response to text input. In natural speech, stressed or accented syllables are produced by raising the pitch of one's voice and/or by increasing the duration of the vowel portion of the accented syllable. By performing these same operations, the text-to-speech synthesizer can mimic the prosody of human speech. We have developed a template-based system to organize and associate prosody information with a sequence of text, where the text is described in terms of some sort of linguistic unit, such as a word or phrase. In our template-based system, a library of templates is constructed for a collection of words or phrases that have different phonological characteristics. Then, given particular text input, the template with the best matching characteristics is selected and used to supply prosodic information for synthesis.
When only a small number of words or phrases needs to be spoken, it is feasible to construct templates for each and every possible word or phrase that may be generated by the synthesizer. However, as the size of the spoken domain increases, it becomes increasingly costly to store all of the required templates.
The present invention provides a solution to this problem by a technique that finds the closest matching template for a given target synthesis and by then finding an optimal mapping between a not-exactly-matching template and target. The system is capable of generating new templates using portions of existing templates when an exactly matching template is not found.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.