This invention relates to speech compression using code-excited linear prediction (CELP), and has particular relation to CELP speech compression which uses a low bit rate.
CELP speech compression exploits the fact that, in the time domain, the human vocal tract produces a sequence of sounds, and that each sound is easily divided into a sequence of very similar pitch intervals. A CELP codec compresses and reconstructs each pitch interval in a two step process: pitch prediction evaluation and innovation signal search.
The pitch prediction evaluation step exploits a characteristic of all pitch intervals: for each pitch interval of the sound, taken at its fundamental pitch, the instantaneous normalized amplitude correlates closely with the instantaneous normalized amplitude at the same part of the previous pitch interval. Normalization means multiplying by some scale factor, and time shifting by some lag (or lead) factor. The instantaneous amplitude of the previous pitch interval is known, or can be synthesized with satisfactory fidelity. Therefore, the instantaneous amplitude of the current pitch interval can be synthesized with satisfactory fidelity even if only the scale and lag factors are known.
In the innovation signal search step, a search is made among a collection of signals, called innovation signals, for the best signal. The library of innovation signals is generally totally random. For each pitch interval of the sound, the innovation signal is selected which most closely approximates, moment to moment, a typical difference between the normalized amplitude of one pitch interval and the normalized amplitude of the previous pitch interval. The innovation signals are therefore inherently normalized. A suitable scale factor by which the innovation signal is to be multiplied must be established. It is often not necessary to further establish a lag factor for the innovation signal, but one can be provided if desired.
The scale and lag factors from the pitch prediction step, and the scale factor and innovation signal from the innovation signal search step, could be transmitted on a telephone line directly. They similarly could be directly recorded on a tape or other recording medium directly; "transmit," as used herein, therefore includes "record," and "receive" therefore includes "play back." Regardless of whether transmission or recording is contemplated, however, direct transmission can be improved upon by coding. Each scale factor is coded in such a fashion that all scale factors in a particular range bin of scale factors are given a single code. A different code is provided for each range. Ranges of pitch lags are similarly coded. Selecting range boundaries may be done in any manner which the worker finds convenient. Good results may be obtained by selecting range boundaries which result in each code being transmitted about as often as any other code is transmitted.
A code is also transmitted indicating which innovation signal was selected. The collection or library of innovation signals therefore forms a codebook, and the "innovation signal search step" is therefore often called the "innovation codebook search step".
The codes may be transmitted using analog technology, but digital transmission is preferred.
At the receiving (or playback) end, CELP processing takes the innovation signal code and reverses it to produce the innovation signal. It takes the innovation scale factor code and reverses it to produce the innovation scale factor. It multiplies the innovation signal by the innovation scale factor to produce a synthesized scaled innovation signal. It takes the overall synthesized signal of the previous pitch interval, lags it by the pitch lag (reversed from the pitch lag code), and multiplies the result by the pitch scale factor (reversed from the pitch scale factor code) to produce a synthesized pitch signal. The synthesized pitch signal and the synthesized scaled innovation signal are added together to form the overall synthesized signal of the current pitch interval. This overall synthesized signal is applied to a linear predictive coding (LPC) synthesis filter. The coefficients of the LPC synthesis filter are adaptively selected at the transmitting (or recording) end, as is known in the art. These coefficients are coded, and the coefficient codes are transmitted with the other codes. The process is then repeated with the next set of codes: LPC filter coefficients, pitch lag, pitch scale factor, innovation index, and innovation scale factor.
At the transmitting (or recording) end, an approximate set of these five codes is selected, and the incoming actual speech is compared with speech from the synthesized signal produced from these five codes. The codes are then adaptively modified until the difference between the actual incoming speech and the speech from the synthesized signal (as determined by a perceptual weighting filter) reaches a minimum. The codes which produce this minimum difference are then transmitted (or recorded) to the receiving (or playback) end.
The foregoing CELP process produces synthesized speech which is perceived by the human ear as intelligible, but not of high fidelity. Additional bits can be devoted to any or all of the five codes to obtain additional fidelity, but such bandwidth is expensive and not always available. What is needed is a way to get improved fidelity, as perceived by the human ear, without requiring additional bit bandwidth.