The following disclosure generally relates to information systems.
In general, conventional text-to-speech application programs produce audible speech from written text. The text can be displayed, for example, in an application program executing on a personal computer or other device. For example, a blind or sight-impaired user of a personal computer can have text from a web page read aloud from the personal computer. Other text to speech applications include those that read from a textual database and provide corresponding audio to a user by way of a communication device, such as a telephone, cellular telephone, portable music player, in-vehicle navigation system or the like.
Speech from conventional text-to-speech applications typically sounds artificial or machine-like when compared to human speech. One reason for this result is that current text-to-speech applications often synthesize momentary pauses in speech with silence. The location and length of pauses is typically determined by parsing the written text and the punctuation in the text such as commas, periods, and paragraph delimiters. However, using empty silence to synthesize pauses, as conventional synthesis applications do, can lead listeners to feel a sense of breathlessness; particularly after lengthy exposure to the results of such synthesis. In human-produced speech, pauses can actually consist of breath intakes, mouth clicks and other non-speech sounds. These non-speech sounds provide subtle clues about the sounds and words that are about to follow. These clues are missing when pauses are synthesized as silence, thus requiring more listener effort to comprehend the synthesized speech.
Some text-to-speech applications produce speech that can include emotive vocal gestures such as laughing, sobbing, crying, scoffing and grunting. However, in general such gestures do not improve comprehension of the resultant speech. Moreover, these techniques rely on explicitly annotated input text to determine where to include the vocal gestures in the speech. Such annotated text may, for example, appear as follows, “What? <laugh1> You mean to tell me this is an improvement? <laugh4>.” The text ‘<laugh1>’ is an example of a specific textual command that directs the synthesis to produce a specific associated sound (e.g., a mocking laugh).