A text-to-speech system (TTS) is one of the human-machine interfaces using speech. TTSs, which can be implemented in software or hardware, convert normal language text into speech. TTSs are implemented in many applications such as car navigation systems, information retrieval over the telephone, voice mail, speech-to-speech translation systems, and comparable ones with a goal of synthesizing speech with natural human voice characteristics. Modern text to speech systems provide users access to multitude of services integrated in interactive voice response systems. Telephone customer service is one of the examples of rapidly proliferating text to speech functionality in interactive voice response systems.
Unit selection synthesis is one approach to speech synthesis, which uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some individual phonemes, diphones, half-phones, syllables, morphemes, words, phrases, and/or sentences. An index of the units in the speech database may then be created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phonemes. At runtime, the desired target utterance may be created by determining the best chain of candidate units from the database (unit selection).
In unit selection speech synthesis, concatenation cost is used to decide whether two speech segments can be concatenated without noise. However, computation of concatenation cost for complex speech patterns or high quality synthesis may be overly burdensome for real time calculations requiring extensive computation resources. One way to address this challenge is pre-saving concatenation cost data for each pair of possibly concatenated speech segments to avoid real time calculation. Still, this approach introduces large memory requirements possibly in the terabytes.