This invention relates to a variable frame length data converter for speech synthesis circuits and particularly for those speech synthesis circuits capable of being implemented one, or a few, integrated circuit chips.
Several techniques are known in the prior art for digitizing human speech. For example, pulse code modulation, differential pulse code modulation, adaptive predictive coding, data modulation, channel vocoders, cepstrum vocoders, formant vocoders, voice excited vocoders and linear predictive coding techniques of speech digitalization are known. The techniques are briefly explained in "Voice Signals: Bit by Bit" on pages 28-34 of the October 1973 issue of IEEE Spectrum.
In certain applications and particularly those in which the digitized speech is to be stored in a memory, most researches tend to use the linear predictive coding technique because it produces very high quality speech using rather low data rates. Linear Predictive Coding systems usually make use of a multi-stage digital filter. In the past, the digital filter has typically been implemented by appropriately programming a large scale digital computer. However, in U.S. patent application Ser. No. 807,461, filed June 17, 1977, since abandoned in favor of continuation U.S. application Ser. No. 905,328 filed May 12, 1978, now U.S. Pat. No. 4,209,844 issued June 24, 1980, there is taught a particularly useful digital filter for a speech synthesis circuit, which digital filter may be implemented on an integrated circuit using standard MOS or equivalent technology. A theoretical discussion of linear predictive coding can be found in "Speech Analysis and Synthesis by Linear Prediction of the Speech Wave" at Volume 50, number 2 (part 2) of The Journal of the Acoustical Society of America.
Disclosed herein is a talking learning aid which utilizes speech synthesis technology for producing human speech. A complete talking learning aid is disclosed, so, in addition to describing the speech synthesis circuits in detail, the details of the controller for the learning aid and the Read-Only-Memory devices used to store the digitized speech are also disclosed. Of course, those practicing the present invention may wish to practice the invention in conjunction with a talking learning aid, such as that described herein, other learning aids or any other application wherein the generation of human speech from digital data is desirable. Using the techniques described in the aforementioned U.S. Pat. No. 4,209,844 and the teachings disclosed herein will permit those desiring to make use of digital speech technology to do so with one, or a small number of relatively inexpensive integrated circuit devices.
As aforementioned, linear predictive coding permits the synthesis of human speech from digital data having a relatively low data rate. For example, using ten bit parameters for speech energy, pitch and ten filter coefficients and by updating these twelve parameters fifty times per second yields very high quality speech and only at a bit rate of 6000 bits per second (bps). Thus, if this data were stored in a 131K bit Read-Only-Memory, then 21.8 seconds of human speech would be stored therein. However, if the bit rate could be dropped to an average of 1000-1200 bps then that 131k bit Read-Only-Memory could store 109 to 131 seconds of spoken speech. The advantages of being able to lower the bit rate while maintaining the quality of the resulting speech are obvious since less Read-Only-Memory capacity is required to store a given amount of speech. Thus, the primary objective of this invention was to effect a reduction of the bit rate without unduly affecting speech quality.
The foregoing objects are achieved as is now described. Variable frame length data is provided to the speech synthesis circuit. Preferably, a full length frame includes a pitch parameter, an energy parameter, a repeat bit and a plurality of speech coefficients. Each parameter is encoded according to a preselected coding scheme and has a preselected length, the encoded parameter being longer as its significance in speech seynthesis increases. A particular encoded pitch parameter is used to signify that the speech is unvoiced. An unvoiced frame includes fewer speech coefficients than does a regular voiced frame. Thus the converter detects this particular encoded pitch parameter and automatically sets the unsent coefficients to zero. The repeat bit is used to signify that the frame contains pitch and energy parameters, but no speech coefficients. Thus, the converter is also sensitive to the repeat bit for controlling the synthesizer to use the speech coefficients from the previous frame during the present frame. Since the particular encoded pitch parameter may be used during a repeat frame, the repeat bit circuitry preferably takes preference over the particular encoded pitch parameter circuitry.
According to the foregoing the frame may include all, a few or none of the speech coefficients. To further reduce the data rate, the converter is preferably also responsive to particular encoded energy parameters indicating that a frame is either a pause or the last frame sent. Thus, the converter is preferably also responsive to these particular encoded energy parameters for controlling the speech synthesizer accordingly.