1. Field of Invention
The present invention generally relates to speech synthesis systems, and more particularly to a speech synthesis system that produces a natural sounding synthesized speech from text and contextual input.
2. Background of Art
There is an increasing need for speech synthesis systems that resemble realistic human speech. For example, realistic synthetic speech is needed wherever state of the art text to speech technologies are applied, such as in automated voice systems, navigation devices and e-mail readers. These systems are also particularly helpful for the disabled, and can provide a person's only means to verbally communicate or receive electronic information. While current synthetic speech systems exist, these systems suffer from degrees of unacceptable speech quality and are insufficient for producing speech for long text passages or in other applications such as computer-based training modules, linguist training materials, or entertainment industry uses, such as cartoons. Accordingly, there is a need for more realistic, natural-sounding synthesized speech.
Current speech synthesis technologies that attempt to make a speaker sound more natural, or to be able to speak in another language or dialect, are largely limited to morphing or transform effects on the existing sample. These attempts to modify an already existing sample are ineffective to accurately replicate natural-sounding synthesized speech. A need exists in the art to input both text and contextual data, to convert the text into synthetic human speech with qualities appropriate to the context, such as the language and dialect of the speaker.
Further, current speech systems often rely on the stored speech inventory of a single speaker to produce speech that is more realistic and resembles that speaker. These systems are constrained by the speaker's limited speech inventory, and are insufficient for reproducing natural sounding speech when the speech inventory does not contain all the necessary phonetic elements to synthesize a given text. Further, even when speech inventories do have the necessary phonetic elements to synthesize a given text, the features of a given phonetic element, such as the frequency, bandwidth, or amplitude, often do not match the features of the following phonetic element, resulting in poor quality synthesized speech. A need therefore exists in the art to expand a speaker's speech inventory so the system has the resources to synthesize speech from any given text in a realistic, natural-sounding way.
For the foregoing reasons, there is a need for more realistic synthetic speech systems.