In recent years, the field of text-to-speech (TTS) conversion has been largely researched, with text-to-speech technology appearing in a number of commercial applications. Recent progress in unit-selection speech synthesis and Hidden Markov Model (HMM) speech synthesis has led to considerably more natural-sounding synthetic speech, which thus makes such speech suitable for many types of applications.
Some contemporary text-to-speech systems adopt corpus-driven approaches, in which corpus refers to a representative body of utterances such as words or sentences, due to such systems' abilities in generating relatively natural speech. In general, these systems access a large database of segmental samples, from which the best unit sequence with a minimum distortion cost is retrieved for generating speech output.
However, although such a sample-based approach generally synthesizes speech with high-level intelligibility and naturalness, instability problems due to critical errors and/or glitches occasionally occur and ruin the perception of the whole utterance. This is one factor that prevents text-to-speech from being widely accepted in applications such as in commercial services.