Text-to-speech conversion has been the object of considerable study for many years. A number of devices of this type have been created and have enjoyed commercial success in limited applications. Basically, the limiting factors in the usefulness of prior art devices were the cost of the hardware, the extent of the vocabulary, the quality of the speech, and the ability of the device to operate in real time. With the advent and widespread use of microcomputers in both the personal and business markets, a need has arisen for a system of text-to-speech conversion which can produce highly natural-sounding speech from any text material, and which can do so in real time and at very small cost.
In recent times, the efforts of synthesizer designers have been directed mostly to improving frequency domain synthesizing methods, i.e. methods which are based upon analyzing the frequency spectrum of speech sound and deriving parameters for driving resonance filters. Although this approach is capable of producing good quality speech, particularly in limited-vocabulary applications, it has the drawback of requiring a substantial amount of hardware of a type not ordinarily included in the current generation of microcomputers.
An earlier approach was a time domain technique in which specific sounds or segments of sounds (stored in digital or analog form) were produced one after the other to form audible words. Prior art time domain techniques, however, had serious disadvantages: (1) they had too large a memory requirement; (2) they produced unnaturally rapid and discontinuous transitions from one phoneme to another; and (3) their pitch levels were inflexible. Consequently, prior art time domain techniques were impractical for high-quality, low-cost real-time applications.