The present invention relates to a process of producing natural sounding speech converted from text, and more particularly, to a method of prosody generation by unit selection from an imitation speech database.
Text to speech (TTS) conversion systems have achieved consistent quality prosody using rule based prosody generation systems. For purposes of this application, rule based systems are systems that rely on human analysis to extract explicit rules to generate the prosody for different cases. Alternatively, corpus based prosody generation methods automatically extract the requested data from a given labeled database. The rule based synthesizer systems have achieved a high level of intelligibility, although their unnatural prosody and synthetic voice quality prevent them from being widely used in communication systems. Natural prosody is one of the more important requirements for high quality speech synthesis, to which users can listen comfortably. In addition, the ability to personalize the prosody of a synthetic voice to that of a certain speaker can be useful for many applications.
Recently, corpus based prosody modeling and generation methods have been shown to be able to produce natural-sounding prosody for text to speech systems. On the other hand, rule based prosody generation systems have the advantage of giving consistent quality prosody. Compared with the corpus based methods, the rule based method allows a conveniently explicit way of handling various prosodic effects that are not currently optimized in corpus based modeling and generation methods.
The present invention provides a method to combine the robustness of the rule based method of text to speech generation with a more natural and speaker adaptive corpus based method. The rule based method produces a set of intonation events by selecting syllables on which there would be either a pitch peak or dip (or a combination), and produces the parameters which originally would be used to generate a final shape of the event. The synthetic shape generated by the rule based method is then utilized to select the best matching units from an imitation speech database of a speaker""s prosody, which are then concatenated to reduce the final prosody.
The database of the speaker""s prosody is created by having the target speaker listen to a set of speech-synthesized sentences, and then imitate their prosody, while trying to still sound natural. The imitation speech is time aligned with the synthetic speech, and the time alignment is used to project the intonation events onto the imitation speech, thus avoiding the work intensive process of labeling the imitation speech database. After this processing, a database is formed of prosody events and their parameters. By using imitation speech, it is possible to reduce unwanted inconsistency and variability in the speaker""s prosody, which otherwise can degrade the generated prosody. For prosody generation, a dynamic programming method is used to select a sequence of prosody events from the database, so as to be both close to the target event sequence, and as to connect to each other smoothly and naturally. The selected events are smoothly concatenated, and their intonation and duration is copied into the syllables and phonemes comprising the new sentence. The method can be used to easily and quickly personalize the prosody generation to that of a target speaker.
Further areas of applicability of the present invention will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.