This invention relates to a speech synthesis system capable of being implemented in an integrated circuit device wherein frames of speech data may be operated upon by a speech synthesizer at a variable frame rate in producing digital speech signals representative of human speech. More particularly, this invention relates to a speech synthesis system in which frames containing digital speech data representative of speech signal parameters and coded frame rate data are utilized, with the frame rate data being decoded to control both the rate at which the incoming frames of speech data are accepted by the speech synthesizer and the required number of interpolation calculations needed to define interpolated speech values between adjacent incoming frames of speech data.
Several techniques are known in the prior art for digitizing human speech. For example, pulse code modulation, differential pulse code modulation, adaptive predictive coding, delta modulation, channel vocoders, cepstrum vocoders, formant vocoders, voice excited vocoders, and linear predictive coding techniques of speech digitization are known. The techniques are briefly explained in "Voice Signals; Bit by Bit" on pages 28-34 of the October, 1973 issue of IEEE Spectrum.
In certain applications and particularly those in which digitized speech is to be stored in a memory, most researchers tend to use the linear predictive coding technique because it produces a very high quality speech using rather low data rates. An excellent example of the use of linear predictive coding systems, implementable in integrated circuit techniques may be seen in U.S. patent application Ser. No. 901,393, filed Apr. 28, 1978, now U.S. Pat. No. 4,209,836 issued June 24, 1980. The speech synthesis system described in the aforementioned U.S. Pat. No. 4,209,836 utilizes frames of data which are comprised of digital representations of pitch, energy and certain linear predictive coefficients for controlling a digital filter. The system described in the aforementioned U.S. Pat. No. 4,209,836 is capable of producing high quality synthetic human speech at a bit rate of as low as 1200 bits per second, utilizing a fixed rate of data frame entry. A more accurate representation of human speech may be obtained by increasing the frame rate to a level significantly higher than that described in U.S. Pat. No. 4,209,836; however, a corresponding increase is experienced in the number of bits which must be stored in memory to synthesize a given quantity of human speech. Further, certain aspects of human speech are quite redundant, and may be accurately synthesized utilizing a data rate significantly lower than that disclosed in the aforementioned U.S. Pat. No. 4,209,836. An ideal solution to the aforementioned problem would require a speech synthesis system capable of synthesizing human speech from frames of data which change rapidly during those complex periods of human speech and change slowly during redundant periods, thereby minimizing the required bit storage. An attempt to solve this problem was documented in two papers delivered at the 1977 IEEE Conference on Acoustics, Speech and Signal Processing, and published in the record thereof. One attempted solution was suggested in "Variable-to-Fixed Rate Conversion of Narrowband LPC Speech" by E. Blackman, R. Viswanathan and J. Makhoul. The aforementioned solution required transmission of pitch, gain and reflection coefficients at three separate variable rates, with separate transmission criterion and a three bit header code to distinguish transmissions. Additionally, transmit and receive buffers were necessary in that system to convert the transmission back into a fixed rate signal. The second attempted solution was documented in a paper entitled "The Application of a Functional/Perceptual Model of Speech to Variable-Rate LPC Systems" by R. Viswanathan, J. Makhoul and R. Wicks. This second solution involved the transmission of pitch and gain information at a fixed frame rate, and utilized a variable frame rate for transmission of reflection coefficients.
It is therefore one object of this invention to improve speech synthesis technology.
It is another object of this invention to provide a speech synthesis system capable of accurately sythesizing human speech at the lowest possible data rate.
It is still another object of this invention to provide a speech synthesis system capable of synthesizing human speech from frames of data which are utilized at varying rates.
In accordance with the present invention, a speech synthesis system which may be implemented in an integrated circuit device is provided for converting frames of speech data at a variable rate into analog signals representative of human speech, wherein the rate at which frames of speech data are processed depends upon the rapidity of change in the speech data of successive frames as determined by the relative complexity of the synthesized human speech represented thereby and generated from the analog signals by a speaker forming a component of the system. The frames of speech data comprise digital representations of values of pitch, energy, filter coefficients and coded frame rate data. The frame rate data is decoded to control the rate at which new frames of speech data are utilized by the speech synthesizer of the system and the number of interpolation calculations required to define interpolated speech values between adjacent incoming frames of speech data. A frame control circuit regulates the frame rate by providing for a variable number of interpolation calculations between adjacent speech frames from last implemented speech data in which the number of interpolation calculations in a given instance is determined by the frame rate data.