The present invention relates generally to text-to-speech synthesis and more particularly to a system for synthesizing animated human quality speech having unlimited vocabulary from prerecorded utterances of basic speech segments.
It is well-known in the prior art to provide synthetic speech from a machine. Early attempts to imitate man's speech invariably took the form of mechanical devices. Modern day efforts invariably developed in electrical terms. Good synthetic speech from machines has been possible for at least the last twenty years, but only with the use of complex minicomputers costing tens of thousands of dollars. However, in recent years both the cost and size of the electronic hardware involved have decreased steadily, and in the process have crossed various thresholds of feasibility for commercial applications of speech synthesis. These prior art systems typically have limited flexibility, being handcrafted and hardwired to synthesize a specific voice. Moreover, no prior art system provides mimicry of a particular person's voice.
Speech consists of a continuously changing complex sound wave resulting from constantly changing aerodynamic and resident conditions in the human vocal track appropriate to the generation of different sounds. Speech synthesis depends on the ability to break down the speech wave into component elements and combine these elements to create new messages. A speech synthesis system which is likely to provide human quality speech must be closely based on the human linguistic system underlying speech events.
The human vocal system is a relatively complex structure including the lungs which supply an airflow through the vocal cords and glottis into the larynx through the oral cavity and out through the lips. The human vocal track includes many different places at which it can change its cross-sectional area, either to alter its resonance characteristics or actually to produce acoustic energy. When one considers the variable degrees of narrowing at each of these articulation sites, and the possibilities for their simultaneous combination, it becomes apparent that the number of acoustically different sounds that can be produced is vast.
Sound can be generated in the vocal system in three ways. Voiced sounds are produced by elevating the air pressure in the lungs, forcing a flow through the glottis, the vocal cord orifice, and causing the vocal cords to vibrate. Fricative sounds of speech are generated by forming a constriction at some point in the vocal track and forcing air through the constriction at a sufficiently high Reynold's number to produce turbulence. Plosive sounds result from making a complete closure, usually towards the front of the vocal track, building up pressure behind the closure and abruptly releasing it.
Typically, speech synthesis involves a modeling of the human vocal tract. The cursive digital filters generate quantized samples of the speech signal. The control functions which specify the resonances, anti-resonances and excitation of the filter must be supplied externally. Generally a linear predictive coding (LPC) method is utilized to provide the necessary filter control functions. A basic model utilized in the LPC method has two major components: a flat spectrum excitation source and a spectral shaping filter. For speech synthesis, the parameters of the spectral shaping filter are set on a time varying basis such that its short term spectrum is the same as the short term speech spectral envelope desired. A prediction error function is derived from the difference between the desired speech signal and the actual synthetic speech signal and is used as the excitation signal for the model. A drawback to using the prediction error function as the excitation signal is the large storage requirements. An effective solution to the storage problem has been to model the excitation signal as coming from one of two sources: a pulse source or a noise source. However, the resulting speech quality is mechanical and tinny and is not as natural as using the prediction error function.