Humans learn information in a variety of ways. Two of the most common ways to learn information are reading text and listening to speech. In many situations, it is desirable to convert into audible speech information that is stored as written text. For example, a parent may read a bedtime story to a child. In certain applications, it is not practical to employ live humans to read a text out loud every time anyone wants to hear the information contained in the text. One approach for handling such situations is to record a human reading the text out loud, and then play back the recording every time someone wants to hear the information contained in the text. This approach is used, for example, to create audio recordings of books.
Unfortunately, even creating recordings of texts is not practical for many applications. For example, a news company may desire to have all of its news stories available as audible speech as well as written text. However, the volume of news stories may make it impractical to have someone read and record all of them. The cost of recording the full-text readings becomes impractically high in many modern applications, such as services that present as audible speech information from thousands or millions of electronic sources of textual information, such as web pages on the World Wide Web.
For applications where full-text readings are impractical, it is possible to store partial-text readings and then combine the partial-texts readings during playback. For example, a human can record the reading of every word in a dictionary, and playback the single-word recordings in the sequence that the words appear in a text. However, this only works when the reader can anticipate every word or phrase in the text. As a practical matter, it is impossible to pre-record all possible words and phrases without knowing the exact content of the texts involved. Thus, the partial-text reading technique works well when the content of all texts involved is known ahead of time, but does not work when it is not.
When the exact content of texts is not known ahead of time, the text is said to contain “unanticipated content”. One approach to providing text-to-speech service for texts that may contain unanticipated content involves the use of a “synthesized voice”. A synthesized voice is produced by programming a device (not an actual human) to pronounce words contained within an input text based on a complex set of pronunciation rules. Unfortunately, even the most sophisticated voice synthesis techniques produce “readings” of notoriously poor quality that many listeners find unacceptable.
Based on the foregoing, it is clearly desirable to provide improved text-to-speech techniques. In particular, it is desirable to provide improved text-to-speech techniques for situations in which the input texts may contain unanticipated content.