1. Field of the Invention
This invention relates to a software tool used to convert text, speech synthesis markup language (SSML), and/or extended SSML to synthesized audio, and particularly to creating, viewing, playing, and editing the synthesized speech including editing pitch and duration targets, speaking type, paralinguistic events, and prosody.
2. Description of Background
Text-to-speech (TTS) systems continue to sometimes produce bad quality audio. For customer applications where much of the text to be synthesized is known and high quality is critical, the sole use of text-to-speech is not optimal.
The most common solution to this problem is to prerecord the application's fixed prompts and frequently synthesized phrases. The use of text-to-speech is then typically limited to the synthesis of dynamic text. This results in a good quality system, but can be very costly due to the use of voice talents and recording studios for the creation of these recordings. This is also impractical because modifications to the prompts depend on the voice talent and studio's availability.
Another drawback is that the voice talent used for prerecording prompts is different than the voice used by the text-to-speech system. This can result in an awkward voice switch in sentences between prerecorded speech and dynamically synthesized speech.
Some systems try to address this problem by enabling customers to interact with the TTS engine to produce an application-specific prompt library. The acoustic editors of some systems enable users to modify the synthesis of the prompt by modifying the target pitch and duration of a phrase. These types of systems overcome frequent problems in synthesized speech, but are limited in solving many types of other problems. For example there is no mechanism for specifying the speaking style, such as apologetic, or for manipulating the pitch contour, adding paralinguistics, or for providing a recording of the prompt from which the system extracts the prosodic parameters.