Low data rate speech communications find use in vehicular systems, secure-communication systems, and the like. A prior-art system is the Federal Standard 1015 Linear Predictive Coding (LPC) algorithm. This LPC algorithm transmits speech at 2400 bits per second. When, as in an aircraft, the speech is likely to be accompanied by noise, or when the vocabulary is not limited, such prior-art systems may not produce understandable speech at the receiving end. Even when the vocabulary is fixed, and noise is not a problem, some such prior-art systems reproduce the speech in an artificial-sounding manner, which does not preserve the attributes of the speaker's voice, and does not allow the speaker to be recognized by voice alone. Also, those prior-art systems which use tree-type octave-band filter banks, such as that system described in "Improved Speech Compression Algorithms: Final Report on Contract F30602-89-C0118," improve the intelligibility of speech in noise. However, the octave-band filter structure introduces additional delay, which makes two-way conversations more difficult, and it adds processing complexity. Further, the octave-band filters restrict the choice of bands to be used.
In some prior art systems, the excitation signal transmitted through the low-data-rate path is characterized by a single speech parameter, namely the average pitch, which corresponds to the rate at which the speaker's vocal folds are vibrating. The average pitch parameter can be specified either as a frequency, or as a time interval between closures of the vocal folds. It is well known that there are other, more subtle features which contribute to the unique character of an individual's voice, including the jitter (short-term variation in the pitch period), and shimmer (period-to-period variation in the power of the excitation). However, complexity has prevented their reproduction in prior-art systems. In some prior art systems, the pulse sequence that is used to excite the speech synthesizer at the receive end of the system is assumed to vary slowly. The pitch as measured in the transmitter often exhibits doubling or halving of the value from one frame to the next. These observed changes may be due to either actual doubling or halving of the speaker's pitch, or to tracking errors in the pitch tracker; in general, it is difficult to distinguish these two sources.
Prior-art speech encoders often represent the speech in a frame as a set or vector of a plurality of digital numbers representing line spectrum frequencies, such as ten digital numbers, each representing the frequency of one spectral line. The assumption is made that the particular number of spectral lines which is selected is sufficient to represent the speech within the frame, to the desired level of accuracy. In such an arrangement, the number of bits in the resulting vector equals the number of bits per digital number, times the number of digital numbers in the vector. For example, if each digital number were to be quantized to three bits, thirty total bits would be required to represent a 10-number vector representing one frame of speech, and if five-bit coding were used, fifty total bits would be required. These bits must eventually be transmitted to a remote receiver over the limited-bandwidth data path, so it is important to minimize the total number of bits required for the representation.
In prior-art vector quantization (VQ), a fixed codebook or library of vectors is established, which is intended to include approximations of all vectors which are likely to be encountered in speech. The line spectrum vector of the speech to be transmitted is compared with the library vectors to find the best match. Instead of transmitting the line spectrum vector itself over the data path, an index or codeword is transmitted which identifies the particular one of the library vectors which is the closest match. The index accesses a corresponding library vector at the receiver. High-quality speech reproduction which is independent of vocabulary and of speaker requires 2.sup.22 vectors, which is about four million vectors. This number of vectors is so large that significant processing time is required for the comparison using currently available technology, and a substantial amount of memory is required at both the transmitter and at the receiver for vector storage.
Another type of quantization which has been described in the prior art for reduction of the data rate, in place of scalar quantization, is split vector quantization (split-VQ). In split-VQ, the line spectrum vector is split into plural portions, such as two portions, each of which is independently quantized using a separate codebook. In the abovementioned 10-number example, the vector might be broken into two portions, a six-number portion and a four-number portion, each of which is quantized using a codebook of 2.sup.12 vectors, corresponding to 4096 vectors. The size of the codebook is based upon experimental results reported in the literature. Using 4096 vectors in a codebook, for example, split-VQ uses 24 bits, which is a greater number of bits than ordinary VQ, but split-VQ is more practical.
Some prior-art speech encoding systems for low data rate transmission further reduce the data rate by taking advantage of the relatively small changes in the speech from frame to frame. This is accomplished by transmitting only a subset of the frames, and by interpolating the missing frames at the receiver. This technique is known as "frame interpolation". Clearly, frame interpolation cannot work properly when its underlying assumption of slowly varying speech is not in fact the case. More particularly, for vowel sounds, the spectrum of the signal changes vary slowly, so the speech content of the frames of the block will be very similar. As a result, the distortion due to frame interpolation will be low. For many consonants, however, the spectrum is changing rapidly so that three successive frames are widely different, and frame interpolation is less efficient. Even when speech does in fact change slowly, frame interpolation does not provide more than about a 50% data rate reduction.
An improved speech encoder is desired.