The present invention relates generally to speech synthesis, and more particularly, to a channel bank speech synthesizer operating without externally-generated voicing or pitch information.
Speech synthesizer networks generally accept digital data and translate it into acoustic speech signals representative of human voice. Various techniques are known in the art for synthesizing speech from this acoustic feature data. For example, pulse code modulation, linear predictive coding, delta modulation, channel bank synthesizers, and formant synthesizers are known synthesizing techniques. The particular type of synthesizer technology is typically chosen by comparing the size, cost, reliability, and voice quality requirements of the specific synthesis application.
The further development of present-day speech synthesis systems is hindered by the inherent problem that the complexity and storage requirements of the synthesizer system dramatically increase with the vocabulary size. Additionally, the words spoken by the typical synthesizer are often of poor fidelity and difficult to understand. Nevertheless, the trade-off between vocabulary and voice intelligibility has all too often been decided in terms of a larger vocabulary for enhanced user features. This determination generally results in a harsh, robot-like "buzziness" sound in the synthesized speech.
Recently, several approaches have been taken to solve the problem of unnatural sounding synthesized speech. Obviously, the reverse trade-off--to maximize voice quality at the expense of speech synthesis system complexity--can be made. It is well known in the art that a high data rate digital computer, synthesizing speech from an infinite memory source, can create the ideal situation of unlimited vocabulary with negligible voice quality degradation. However, such devices tend to be much too bulky, very complicated, and prohibitively expensive for most modern applications.
Pitch-excited channel bank synthesizers have frequently been used as a simple, low cost means for synthesizing speech at a low data rate. The standard channel bank synthesizer consists of a number of gain-controlled bandpass filters, and a spectrally-flat excitation source made up of a pitch pulse generator for voiced excitation (buzz) and a noise generator for unvoiced excitation (hiss). The channel bank synthesizer utilizes externally-generated acoustic energy measurements (derived from human voice parameters) to adjust the gains of the individual filters. The excitation source is controlled by a known voiced/unvoiced control signal (prestored or provided from an external source) and a known pitch pulse rate.
A renewed interest in channel vocoders has led to a wide variety of proposals to improve the quality of low data rate synthesized speech. Fujimura, in an article entitled "An Approximation to Voice Aperiodicity", IEEE Transactions on Audio and Electroacoustics, vol. AU-16, no. 1, pp. 68-72 (March 1968), describes a technique called "partial devoicing"--partially replacing voiced excitation of the high-frequency ranges by random noise--to make the synthesized sound less mechanically "buzzy". On the other hand, Coulter, in U.S. Pat. No. 3,903,366, purports to improve the performance of channel vocoders by connecting the pitch pulse source to the lowest channel of the vocoder synthesizer at all times. Alternatively, the article entitled "The JSRU Channel Vocoder", IEE Proceeding, vol. 127, part F, no. 1, pp. 53-60 (February 1980), by J.N. Holmes describes a technique for reducing the "buzzy" quality of voiced sounds by varying the bandwidth of the high-order channel filter in response to the voiced/unvoiced decision.
Several other approaches were taken to the "buzziness" problem in the context of LPC vocoders. "A Mixed-source Model for Speech Compression and Synthesis" by J. Makhoul, R. Viswanathan, R. Schwartz, and A.W.F. Huggins, 1978 International Conference on Acoustics, Speech, and Signal Processing, pp. 163-166, (Apr. 10-12, 1978), describes an excitation source model which permits varying degrees of voicing by mixing voice (pulse) and unvoiced (noise) excitations in a frequency-selective manner. Yet another approach was taken by M. Sambur, A. Rosenberg, L. Rabiner, and C. McGonegal, in an article entitled "On Reducing the Buzz in LPC Synthesis", 1977 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 401-404, (May 9-11, 1977). Sambur et al. reported a reduction in buzziness by changing the pulse width of the excitation source to be proportional to the pitch period during voiced excitation. Still another approach, that of modulating the amplitude of the excitation signal (from a substantially 0 value to a constant value and then back to 0) was taken by Vogten et al. in U.S. Pat. No. 4,374,302.
All of the above prior art techniques are directed toward improving the voice quality of a low data rate speech synthesizer through modification of the voicing and pitch parameters. Under normal circumstances, this voicing and pitch information is readily accessible. However, none of the known prior art techniques are viable for speech synthesis applications in which voicing or pitch parameters are not available. For example, in the present application of synthesizing speech from speech recognition templates, voicing and pitch parameters are not stored, since they are not required for speech recognition. Hence, to accomplish speech synthesis from recognition templates, the synthesis must be performed without prestored voicing or pitch information.
It is believed that most practitioners skilled in the art of speech synthesis would predict that any computer-generated voice, created without externally accessible voicing and pitch information, would sound extremely robot-like and highly objectionable. To the contrary, the present invention teaches a method and apparatus of synthesizing natural-sounding speech for applications in which voicing or pitch cannot be provided.