Speech is the natural form of human communication, and it can enhance human machine communication. A text-to-speech system (TTS) is one of the human-machine interfaces using speech. TTSs, which can be implemented in software or hardware, convert normal language text into speech. TTSs are implemented in many applications such as car navigation systems, information retrieval over the telephone, voice mail, speech-to-speech translation systems, and comparable ones with a goal of synthesizing speech with natural human voice characteristics.
Synthesized speech can be created by concatenating pieces of recorded speech from a data store or generated by a synthesizer that incorporates a model of the vocal tract and other human voice characteristics to create a completely synthetic voice output. Hidden Markov Model (HMM) based synthesis is a synthesis method based on hidden Markov models. A frequency spectrum (vocal tract), a fundamental frequency (vocal source), and a duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are then generated from HMMs themselves based on the maximum likelihood criterion.
The increasingly popular HMM based text to speech systems (HTSs) generate a series of acoustic parameters and synthesize waves based on these parameters such as Line Frequency Spectrum (LFS). The acoustic parameters typically include constraints, but those constraints may be violated during the generation of the parameters from HMMs, which results in artifacts in the generated speech such as noise.