The present invention relates generally to speech synthesis and, more particularly, to producing naturally computer-generated speech by identifying and applying speech patterns in a voice dialog scenario.
In a typical voice dialog scenario, the structure of the spoken messages is fairly well defined. Typically, the message consists of a fixed portion and a variable portion. For example, in a vehicle speech synthesis system, a spoken message may comprise the sentence xe2x80x9cTurn left of on Mason Street.xe2x80x9d The spoken message consists of a fixed or carrier portion and a variable or slot portion. In this example, xe2x80x9cTurn left on ______xe2x80x9d defines the fixed or carrier portion, and the name of the street xe2x80x9cMason Streetxe2x80x9d defines the variable or slot portion. As the identifier implies, the speech synthesis system may change the variable portion so that the speech synthesis system can direct a driver to follow directions involving multiple streets or highways.
Existing speech synthesis systems typically handle the insertion of the variable portion into the fixed portion rather poorly, creating a rather choppy and unnatural speech pattern. One approach to improving the quality for generating voice dialog can be found with reference to U.S. Pat. No. 5,727,120 (Van Coile), issued Mar. 10, 1998. The Van Coil patent receives a message frame having a fixed and variable portion and generates a markup for the entire message frame. The entirety of the message frame is broken down to phonemes, and necessarily requires a uniform presentation of the message frame. In the speech markup of an enriched phonetic transcription formulated with the phonemes, the control parameters are provided at the phoneme level. Such a markup does not guarantee optimal acoustic sound unit selection when rebuilding the message frame. Further, the pitch and duration of the message frame, known as the prosody, is selected for the entire message frame, rather than the individual fixed and variable portions. Such a message frame construction renders building the frame inflexible, as the prosody of the message frame remains fixed. Further, it is desirable to change the prosody of the variable portion of a given message frame.
The present invention takes a different, more flexible approach in building the fixed and variable portions of the message frame. The acoustic portion of each of the fixed and variable portions is constructed with predetermined set of acoustic sound units. A number of prosodic templates are stored in a prosodic template database, so that one or a number of prosodic templates can be applied to a particular fixed and variable portion of the message frame. This provides great flexibility in building the message frames. For example, one, two, or even more prosodic templates can be generated for association with each fixed and variable portion, thereby providing various inflections in the spoken message. Further, the prosodic templates for the fixed portion and variable portion can thus be generated separately, providing greater flexibility in building a library database of spoken messages. For example, the acoustic and prosodic fixed portion can be generated at the phoneme, word, or sentence level, or simply be pre-recorded. Similarly, templates for the variable portion may be generated at the phoneme, word, phrase level, or simply be pre-recorded. The different fixed and variable portions of the message frame are concatenated to define a unified acoustic template and a unified prosodic template.
For a more complete understanding of the invention, its objects and advantages, reference should be made to the following specification and to the accompanying drawings.