In general, a text-to-speech (TTS) system can convert input text into an acoustic waveform that is recognizable as speech corresponding to the input text. More specifically, speech generation involves, for example, transforming a string of phonetic and prosodic symbols into a synthetic speech signal. It is desirable for a TTS system to provide synthesized speech that is intelligible, as well as synthesized speech that sounds natural.
To synthesize natural-sounding speech, it is essential to control prosody. Prosody refers to the set of speech attributes which do not alter the segmental identity of speech segments, but rather affect the quality of the speech. An example of a prosodic element is lexical stress. The lexical stress pattern within a word plays a key role in determining the manner in which the word is synthesized, as stress in natural speech is typically realized physically by an increase in pitch and phoneme duration. Thus, acoustic attributes such a pitch and segmental duration patterns provide important information regarding prosodic structure. Therefore, modeling them greatly improves the naturalness of synthetic speech.
Some conventional TTS systems operate on a pure text input and produce a corresponding speech output with little or no preprocessing or analysis of the received text to provide pitch information for synthesizing speech. Instead, such systems use flat pitch contours corresponding to a constant value of pitch, and consequently, the resulting speech waveforms sound unnatural and monotone.
Other conventional TTS systems are more sophisticated and can process text input to determine various attributes of the text which can influence the pronunciation of the text. The attributes enable the TTS system to customize the spoken outputs and/or produce more natural and human-like pronunciation of text inputs. The attributes can include, for example, semantic and syntactic information relating to a text input, stress, pitch, gender, speed, and volume parameters that are used for producing a spoken output. Other attributes can include information relating to the syllabic makeup or grammatical structure of a text input or the particular phonemes used to construct the spoken output.
Furthermore, other conventional TTS systems process annotated text inputs wherein the annotations specify pronunciation information used by the TTS to produce more fluent and human-like speech. By way of example, some TTS systems allow the user to specify “marked-up” text, or text accompanied by a set of controls or parameters to be interpreted by the TTS engine.
FIG. 1 is a diagram that illustrates a conventional system for providing text-to-speech synthesis. The system (10) comprises a user interface (11) that allows a user to manually generate marked-up text that describes the manner in which text is to be synthesized based on, e.g., pronunciation, volume, pitch, and rate attributes, etc.
For example, for a text input such as “Welcome to the IBM text-to-speech system”, a marked-up version of the text can be, for example: “\prosody<rate=fast> Welcome to the \emphasis IBM text-to-speech system”, which instructs the synthesizer to produce fast speech, with emphasis on “IBM.” The marked-up text is processed by a TTS engine (12) that is capable of parsing and processing the marked-up text to generate a synthetic waveform in accordance with the markup specifications, using methods known to those of ordinary skill in the art. The TTS engine (12) can output the synthesized text to a loudspeaker (13).
The process of manually generating marked-up text for TTS can be very burdensome. Indeed, in order to achieve a desired effect, the user will typically use trial-and-error to generate the desired marked-up text. Furthermore, although the conventional system (10) of FIG. 1 affords the user a certain degree of freedom for controlling the output speech, it is extremely difficult and tedious to achieve fine control of the pitch or duration using such method. For example, the user would have to hypothesize a set of pitches and durations for each sound, test the output to see how closely he/she achieved the desired effect, and then iterate the process until the speech generated by the TTS system matched the prosodic characteristics desired by the user.