1. Technical Field
The present invention relates to the field of synthetic speech generation and, more particularly, to generating natural sounding speech.
2. Description of the Related Art
Synthetic speech generation is used in a multitude of situations, such as interactive voice response (IVR) applications, devices to aid specific handicaps, embedded computing systems, educational systems for automated teaching, children's electronic toys, and the like. In many of these situations customer acceptance and satisfaction of the generated speech is critical.
For example, IVR applications can be designed for customer convenience and to reduce business operating costs by reducing telephone related staffing requirements. In the event that customers are dissatisfied with the IVR system, individual customers will either opt out of the IVR system to speak with a human agent, will become generally disgruntled and factor their dissatisfaction into future purchasing decisions, or simply refuse to utilize the IVR system at all.
One reason many users dislike using systems that generate synthetic speech is that such speech can sound mechanical or unnatural and can be audibly unpleasant, even difficult to comprehend. The unnaturalness of synthetic speech results from flawed prosodic characteristics of the speech. Prosodic characteristics include the rhythmic aspects of language or the suprasegmental phonemes of pitch, stress, rhythm, juncture, nasalization, and voicing. Speech segments can include many discernable prosodic features such as audible changes in pitch, loudness, and syllable length.
One manner of generating synthetic speech, concatenative text-to-speech (TTS), joins discreet acoustic units together to form words. The acoustic units used in concatenative TTS are originally extracted from human speech. A variety of factors (such as how large the acoustic units are, how many units are stored, how units are represented, and what algorithms are used to select among units) contribute to the overall quality of generated synthetic speech. Relatively minor flaws and inaccuracies within acoustic units can result in large distortions within synthetic speech generated by concatenative TTS applications.
The speech samples used for generating acoustic units are derived from humans reading selected scripts. The content of scripts is varied and can include any type of material, such as excerpts from novels, newspapers, or magazines. The scripts can be accentuated heavily or read in a less dramatic, more professional manner. The selection of a linguistically clear and pleasant sounding speaker, the script utilized, and the manner of reading a script all substantially affect the acoustic units used for concatenative TTS generation. Despite numerous approaches undertaken and considerable research into improving prosodic characteristics of synthetically generated speech, conventional TTS generation still produces unnatural sounding speech which is generally disfavored by listeners.