Text-to-speech conversion technologies take text phrases as input and generate audio phrases—audio data encoding spoken audio representative of the text phrases—which can then be “read aloud” via an audio interface of an electronic device.
Conventionally, the text phrases to be read aloud in this manner are converted into audio phrases on a word-for-word basis, such that each text word in a text phrase is converted into an audio word, and the audio words are assembled in the same order as their corresponding text words appear in the text phrase. The composition of the audio phrase and the composition of the text phrase therefore match word for word. For example, the text phrase “There are 2000 jelly beans in the jar” may be converted into an audio phrase which would be pronounced “There are two thousand jelly beans in the jar.”
In some cases, however, the context of a text phrase is such that an audio phrase generated on a word-for-word basis may sound unnatural when read aloud. As a simple example, the text phrase “The Tate Modern opened in 2000” may be converted into an audio phrase which would be pronounced “The Tate Modern opened in two thousand,” when an English speaker would more naturally have said “The Tate Modern opened in the year two thousand.” Thus, in some cases, conventional word-for-word conversion of text phrases results in unnatural-sounding audio phrases, and it may be desirable to modify text phrases in order to facilitate subsequent text-to-speech conversion in some contexts.
There is therefore a need for improved methods for text processing.