Concatenative speech synthesis is a form of speech synthesis which relies on the concatenation of acoustic units that correspond to speech waveforms to generate speech from written text. An unsolved problem in this area is the optimal selection and concatenation of the acoustic units in order to achieve fluent, intelligible, and natural sounding speech.
In many conventional speech synthesis systems, the acoustic unit is a phonetic unit of speech, such as a diphone, phoneme, or phrase. A template or instance of a speech waveform is associated with each acoustic unit to represent the phonetic unit of speech. The mere concatenation of a string of instances to synthesize speech often results in unnatural or "robotic-sounding" speech due to spectral discontinuities present at the boundary of adjacent instances. For the best natural sounding speech, the concatenated instances must be generated with timing, intensity, and intonation characteristics (i.e., prosody) that are appropriate for the intended text.
Two common techniques are used in conventional systems to generate natural sounding speech from the concatenation of instances of acoustical units: the use of smoothing techniques and the use of longer acoustical units. Smoothing attempts to eliminate the spectral mismatch between adjacent instances by adjusting the instances to match at the boundaries between the instances. The adjusted instances create a smoother sounding speech but the speech is typically unnatural due to the manipulations that were made to the instances to realize the smoothing.
Choosing a longer acoustical unit usually entails employing diphones, since they capture the coarticulary effects between phonemes. The coarticulary effects are the effects on a given phoneme due to the phoneme that precedes and the phoneme that follows the given phoneme. The use of longer units having three or more phonemes per unit helps to reduce the number of boundaries which occur and capture the coarticulary effects over a longer unit. The use of longer units results in a higher quality sounding speech but at the expense of requiring a significant amount of memory. In addition, the use of the longer units with unrestricted input text can be problematic because coverage in the models may not be guaranteed.