1. Field of Invention
The techniques described herein are directed generally to the field of speech synthesis, and more particularly to techniques for performing prosody prediction in speech synthesis.
2. Description of the Related Art
Speech synthesis is the process of making machines, such as computers, “talk”. Speech synthesizers generally begin with an input text of a sentence or other utterance to be spoken, and convert the input text to an audio representation that can be played, for example, over a loudspeaker to a human listener. Various techniques exist for synthesizing speech from text, including formant synthesis, articulatory synthesis, hidden Markov model (HMM) synthesis, concatenative text-to-speech synthesis and multiform synthesis.
Each of these types of speech synthesis attempts to predict the sequence of sound segments that will best convert the input text to speech. Segments are discrete phonetic or phonological units, such as phonemes, that combine in a distinct temporal order to form a speech utterance encoding some lexical meaning. Often, segments are aspects of speech that are encoded as alphabetic characters when speech is transcribed into writing. For example, for the input text, “See Jack run,” a synthesis system would predict the phoneme sequence, /s-ee-j-a-k-r-uh-n/. The synthesis system can then produce each of the sound segments in sequence (e.g., /s/ followed by /ee/, followed by /j/, etc.) to result in an audio utterance of the input text.