Converting text into voice output using speech synthesis techniques is nothing new. A variety of TTS systems are available today, and are getting increasingly natural and intelligent. However, the conventional TTS systems based on formant synthesis and articulatory synthesis are not mature enough to produce the same quality of synthetic speech, as one would obtain from a concatenative database approach.
For instance, rule-based synthesizers, in the form or formant synthesizers, relate to formant and anti-formant frequencies and bandwidth. Such rule-based synthesizers produce errors, because formant frequencies and bandwidths are difficult to estimate from speech data. Rule-based synthesizers are useful for handling the articulatory aspects of changes in speaking style. In a rule-based system, the acoustic parameter values for the utterance are generated entirely by algorithmic means. A set of rules sensitive to the linguistic structure generates a collection of values, such as frequencies and bandwidths that capture the perceptually important cues for reproducing the spoken utterance. A set of procedures modifies these cues in accordance with the values specified for a number of parameters to produce the desired voice quality. A synthesizer generates the final speech waveform from the parameter values. Rule-based approaches require extensive knowledge and understanding of the sound patterns of speech. Rule-based synthesizers are a long way from being naturalistic, in comparison to the concatenative synthesizers, and therefore, the results based on a rule-based synthesizer are less realistic.
To achieve better quality of speech, TTS systems using concatenative speech database are currently very popular and widely used. Although a TTS system based on a concatenative database provides better quality of speech in comparison to the conventional systems mentioned above, minimizing the database size, without compromising the speech quality, is a major obstacle the system faces today. For instance, a TTS system based on a concatenative database approach employs, among other things, a diphone database, to completely map the range of human speech production, which results in a very large effective size (perhaps, up to 6 MB) of the concatenative database. Thus, implementing a TTS system using concatenative database in devices with limited memory, such as handheld devices, or which rely upon Internet download of customizable speech databases (e.g. for character voices) is particularly difficult due to the large size of the speech database. Most conventional compressions of speech database in TTS systems are limited to mu-law and A-law compressions, which are essentially forms of non-linear quantization. These methods produce only a minimal compression.