One type of voice communications system of interest to the teaching of this invention uses a code division, multiple access (CDMA) technique such as one originally defined by the EIA Interim Standard IS-95A, and in later revisions thereof and enhancements thereto. This CDMA system is based on a digital spread-spectrum technology which transmits multiple, independent user signals across a single 1.25 MHz segment of radio spectrum. In CDMA, each user signal includes a different orthogonal code and a pseudo-random binary sequence that modulates a carrier, spreading the spectrum of the waveform, and thus allowing a large number of user signals to share the same frequency spectrum. The user signals are separated in the receiver with a correlator which allows only the signal energy from the selected orthogonal code to be de-spread. The other users signals, whose codes do not match, are not de-spread and, as such, contribute only to noise and thus represent a self-interference generated by the system. The SNR of the system is determined by the ratio of desired signal power to the sum of the power of all interfering signals, enhanced by the system processing gain or the spread bandwidth to the baseband data rate.
The CDMA system as defined in IS-95A uses a variable rate voice coding algorithm in which the data rate can change dynamically on a 20 millisecond frame by frame basis as a function of the speech pattern (voice activity). The Traffic Channel frames can be transmitted at full, 1/2, 1/4 or 1/8 rate (9600, 4800, 2400 and 1200 bps, respectively). With each lower data rate, the transmitted power (E.sub.s) is lowered proportionally, thus enabling an increase in the number of user signals in the channel.
Toll quality speech reproduction at low bit rates (e.g., around 4,000 bits per second (4 kb/s) and lower, such as 4, 2 and 0.8 kb/s) has proven to be a difficult task. Despite efforts made by many speech researchers, the quality of speech that is coded at low bit rates is typically not adequate for wireless and network applications. In the conventional CELP algorithm, the excitation is not efficiently generated and the periodicity existing in the residual signal during voiced intervals is not appropriately exploited. Moreover, CELP coders and their derivatives have not shown satisfactory subjective performance at low bit rates.
In a conventional analysis-by-synthesis ("AbS") coding of speech, the speech waveform is partitioned into a sequence of successive frames. Each frame has a fixed length and is partitioned into an integer number of equal length subframes. The encoder generates an excitation signal by a trial and error search process whereby each candidate excitation for a subframe is applied to a synthesis filter, and the resulting segment of synthesized speech is compared with a corresponding segment of target speech. A measure of distortion is computed and a search mechanism identifies the best (or nearly best) choice of excitation for each subframe among an allowed set of candidates. Since the candidates are sometimes stored as vectors in a codebook, the coding method is referred to as code excited linear prediction (CELP). At other times, the candidates are generated as they are needed for the search by a predetermined generating mechanism. This case includes, in particular, multi-pulse linear predictive coding (MP-LPC) or algebraic code excited linear prediction (ACELP). The bits needed to specify the chosen excitation subframe are part of the package of data that is transmitted to the receiver in each frame.
Usually the excitation is formed in two stages, where a first approximation to the excitation subframe is selected from an adaptive codebook which contains past excitation vectors, and then a modified target signal is formed as the new target for a second AbS search operation which uses the above described procedure.
In Relaxation CELP (RCELP) in the Enhanced Variable Rate Coder (TIA/EIA/IS-127) the input speech signal is modified through a process of time warping to ensure that it conforms to a simplified (linear) pitch contour. The modification is performed as follows.
The speech signal is divided into frames and linear prediction is performed to generate a residual signal. A pitch analysis of the residual signal is then performed, and an integer pitch value, computed once per frame, is transmitted to the decoder. The transmitted pitch value is interpolated to obtain a sample-by-sample estimate of the pitch, defined as the pitch contour. Next, the residual signal is modified at the encoder to generate a modified residual signal, which is perceptually similar to the original residual. In addition, the modified residual signal exhibits a strong correlation between samples separated by one pitch period (as defined by the pitch contour). The modified residual signal is filtered through a synthesis filter derived from the linear prediction coefficients, to obtain the modified speech signal. The modification of the residual signal may be accomplished in a manner described in U.S. Pat. No. 5,704,003.
The standard encoding (search) procedure for RCELP is similar to regular CELP except for two important differences. First, the RCELP adaptive excitation is obtained by time-warping the past encoded excitation signal using the pitch contour. Second, the analysis-by-synthesis objective in RCELP is to obtain the best possible match between the synthetic speech and the modified speech signal.