Much work has been done on electronic speech synthesis and generally has resulted in systems requiring huge amounts of memory and/or having only a small vocabulary. It is preferred, on the other hand to have an unlimited vocabulary and to carry out the generation of good, intelligible speech with the smallest possible amount of hardware.
Artificial speech systems fall into two categories. The first is a vocal tract system which attempts to emulate the human vocal tract by the use of variable filters to generate sounds representing the basic sounds within speech. The second is a waveform system which records speech samples and pieces them together as needed to reconstruct a given utterance. Either case requires that some type of input be interpreted as words, and because of the wide variety of word sounds, it has been proposed to amass large electronic libraries of all words or word portions which serve as building blocks of speech, and to draw upon them to construct an output. Such an approach is expensive in terms of the amount of hardware required for storage. Much has been written about data compression to reduce storage, but the storage of the required amount of compressed data for an unlimited vocabulary is still a formidable problem.
Speech can be divided into sound building blocks called phonemes. A word may be constructed of a few phonemes smoothly connected together. Many phonemes, chiefly vowel sounds, are dependent on context and thus have a number of variations called allophones. Speech based on basic phonemes without the variations sounds strange due to poor articulation and is sometimes unintelligible. Proposals to store all possible allophones result again in major storage problems. Other schemes using diphones to resolve contextual variations result in very large numbers of diphones and require even greater storage.
Whenever digitized waveforms are used, the prior processes for concatenating them generally result in highly objectionable discontinuities. While there have been attempts at smooth amplitude blending, they have largely been ineffective at least for some transitions such as fricative boundaries.