The present invention relates generally to speech synthesizers and, more specifically, to a formant-based speech synthesizer.
The application of digital and analog network synthesis to the generation of artificial speech has been an area of active research interest for over two decades. Methods of implementing speech synthesizers range from digital algorithms in large-scale mainframe-based systems to VLSI components intended for commercial consumption. Analysis and synthesis techniques most commonly used for speech processing rely upon concepts such as LPC (Linear Predictive Coding), PARCOR (Partial Autocorrelation), CVSD (Continuously Variable-Slope Delta Modulation) and waveform compression. Generally, these methods share either or both of two deficiencies: (1) the speech quality is sufficiently coarse or mechanical to become annoying after repeated listening sessions, and (2) the bit rate of the associated encoding scheme is too high to permit memory efficient realization of large vocabulary systems. To date, these limitations have restricted high-volume application of speech synthesizers to the consumer marketplace.
Multiple-path formant-based synthesizers have been developed to overcome the limitations of the other approaches, examples of which are described in:
(1) B. Gold and L. R. Rabiner, "Analysis of digital and analog formant synthesizers", IEEE Trans. Audio and Elect., AU-16 (1), pp. 81-94, Mar. 1968;
(2) L. R. Rabiner, "Digital-formant synthesizer for speech synthesis studies", J. Acoust. Soc. Am., Vol. 43, No. 4, pp. 822-828, 1968;
(3) L. R. Rabiner et al, "A hardware realization of a digital formant speech synthesizer", IEEE Trans. Comm. Tech., Vol. COM-19, No. 6, pp. 1016-1020, Dec. 1971;
(4) D. H. Klatt, "Software for a cascade/parallel formant synthesizer", J. Acoust. Soc. Am., Vol. 65, No. 3, pp. 971-995, March 1980; and
(5) L. McCready et al, "A monolithic formant-based speech synthesizer", Proc. 1981 Int. Symp. Circuits and Systems, pp. 986-988.
The systems described are capable of generating all or substantially all of the seven basic sound classes of human speech, namely, vowels, aspirates, nasals, voice bar, fricatives, stops, voiced fricatives and pauses except for the second Rabiner article.
The earlier multiple-path formant-based synthesizers described by Rabiner and Klatt included a substantial number of elements which made them difficult to implement on a single chip. In these systems in addition to the initial shaping network, the output waveform is further processed by a radiation network. Similarly, the voiced and the fricative signal paths each included their own complete set of sometime duplicate filters. While the synthesizer described by McCready et al reduced the complexity, it also potentially limited the quality of the generated sound. For example, the pole and zero filters were deleted from the voiced signal path and special programming of the first formant filter was required for nasal sounds. The modulation of the noise source by the voice source for voiced fricatives was also deleted.
All of the above formant-based synthesizers use second order lowpass filters for all the formant filters. The response of these filters produces an excess of spectral tilt in the resulting waveform when realized with analog filters. Because of a symmetry about half-sampling frequency, attenuation roll-off is generally much shallower when implemented with digital filters. However, excessive tilt may also be observed in speech spectra produced by digital low pass filters for particular speakers and certain sounds. As described in the Rabiner article, higher pole compensation networks are typically needed for spectral correction in analog synthesizers.