The present invention relates generally to speech synthesizers and, more specifically, to a memory-efficient speech data encoding scheme.
The application of digital and analog network synthesis to the generation of artificial speech has been an area of active research interest for over two decades. Methods of implementing speech synthesizers range from digital algorithms in a large-scale mainframe-based systems to VLSI components intended for commercial consumption. Analysis and synthesis techniques most commonly used for speech processing rely upon concepts such as LPC (Linear Predictive Coding), PARCOR (Partial Autocorrelation), CVSD (Continuously Variable Slope Delta Modulation) and waveform compression. Generally, these methods share either or both of two deficiencies: (1) the speech quality is sufficiently coarse or mechanical to become annoying after repeated listening sessions, and (2) the bit rate of the associated encoding scheme is too high to permit memory efficient realization of large vocabulary systems. To date, these limitations have restricted high-volume application of speech synthesizers to the consumer marketplace.
Techniques for defining useful speech synthesizer parameters and extracting time-varying values from actual human speech are diverse. Such procedures fall under the general categories of "speech data extraction" and "speech parameter tracking." Such methods usually involve digitization of original human speech followed by successive application of many complex algorithms in order to produce useful parameter values. These algorithms must be implemented on digital computers and normally do not produce speech data in real time. In addition to computer speech analysis and parameterization from digitized human speech, other methods of deriving the synthesizer parameters may include visual analysis of speech waveforms on sonograph plots, artificial parameter generation by rule, and conversion from analysis data assembled by other synthesis methods.
Once the speech data has been generated, it is desirable to reduce it to some binary format which allows convenient and efficient storage in the memory space of the synthesizer. Methods for achieving this are often termed "speech data compression" or "speech data reduction" and the binary data formats they produce are generally referred to as "speech data coding schemes." The reduction methods are usually implemented as digital algorithms which operate on the output of the parameter tracking routines. To be properly and usefully implemented, a speech data encoding scheme must contain values for all synthesizer parameters necessary for high-quality speech reproduction and should permit storage of these values in significantly less memory space than that required by the output of the parameter tracking routine itself.
Most speech synthesizers and their associated data extraction and compression algorithms are "frame" oriented. A frame is defined as a small fixed time segment of the original speech waveform. The frame duration is short enough (usually on the order of 10 msec) so that the speech signal does not vary greatly during that interval. Thus, the analysis algorithms divide the original speech signal into successive, discrete time intervals, or frames, of uniform duration and extract sets of parameter values for each frame. The data reduction algorithms then condense these values into the encoding scheme which, in turn, is stored in memory. The encoded data are thus bit packets which are also oriented successively in time by frames.
The synthesizer accesses the speech memory at the same frame rate used to analyze the original speech and code the data. During each frame, a single packet of encoded speech data is read into the synthesizer. Each bit packet must contain two general classes of information: (1) an instruction containing the type of sound or speech to be generated (synthesizer architecture configuration), and (2) the encoded speech parameter data required to produce the speech segment. The coding technique by which this is accomplished directly affects the size of the memory necessary to store all the data packets required for any given synthetic utterance.
A figure of merit, called the "bit rate," has been defined for data coding schemes as a measure of performance. The bit rate is the ratio of memory size requirement (binary data) to corresponding speech segment duration (seconds). Given equivalent speech quality, a coding scheme with a low bit rate is considered to be more efficient than a scheme with a higher bit rate. There is, however, a rough correlation between bit rate and speech quality over wide ranges of bit rate when many different coding schemes are considered.
Phoneme synthesizers generally have a bit rate on the order of 100 bits per second and produce a synthesizer with mechanical sound. Linear predictive coding and waveform compression achieve substantially better speech quality, but require a bit rate on the order of 1000 bits per second. Substantially optimum speech quality is achieved by CVSD and pulse code modulation at a bit rate at or above 16,000 per bits per second. Formant synthesis has the capability of producing speech quality between LPC and CVSD at a bit rate less than LPC which is counter to the general relationship between speech quality and bit rate of prior art methods.
An example of data compression for linear predictive coding is described in U.S. Pat. No. 4,209,836 to Wiggins, Jr., et al. wherein a 6000 bits per second scheme is reduced to 1000 to 1200 bits per second. Recognizing that formant data can be stored more efficiently than the reflective coefficients of linear predictive coding, U.S. Pat. No. 4,304,965 to Blanton et al. uses formant data for storage at an equivalent bit rate as low as 300 bits per second and converts it to LPC type reflective coeffecients for use in an LPC-based speech synthesizer.
There is a need to provide a data compression scheme for formant based synthesizer having reduced memory requirements while maintaining speech quality.