1. Technical Field
The present invention relates generally to speech encoding and decoding in mobile cellular communication networks and, more particularly, it relates to various techniques used with code-excited linear prediction coding to obtain high quality speech reproduction through a limited bit rate communication channel.
2. Related Art
Signal modeling and parameter estimation play significant roles in data compression, decompression, and coding. To model basic speech sounds, speech signals must be sampled as a discrete waveform to be digitally processed. In one type of signal coding technique, called linear predictive coding (LPC), the signal value at any particular time index is modeled as a linear function of previous values. A subsequent signal is thus linearly predictable according to an earlier value. As a result, efficient signal representations can be determined by estimating and applying certain prediction parameters to represent the signal.
For linear predictive analysis, neighboring speech samples are highly correlated. Coding efficiency can be improved by canceling redundancies by using a short term predictor to extract the formants of the signal. To compress speech data, it is desirable to extract only essential information to avoid transmitting redundancies. If desired, speech can be grouped into segments or short blocks, where various characteristics of the segments can be identified. xe2x80x9cGood qualityxe2x80x9d speech may be characterized as speech that, when reproduced after having been encoded, is substantially perceptually indistinguishable from spoken speech. In order to generate good quality speech, a code excited linear predictive (CELP) speech coder must extract LPC parameters, pitch lag parameters (including lag and its associated coefficient), an optimal excitation (innovation) code-vector from a supplied codebook, and a corresponding gain parameter from the input speech. The encoder quantizes the LPC parameters by implementing appropriate coding schemes.
More particularly, the speech signal can be modeled as the output of a linear-prediction filter for the current speech coding segment, typically called frame (typical duration of about 10-40 ms), where the filter is represented by the equation:
A(z)=1xe2x88x92a1zxe2x88x921xe2x88x92a2zxe2x88x922xe2x88x92 . . . xe2x88x92anpzxe2x88x92np
and the nth sample can be predicted by             y      ^        ⁢          (      n      )        =            ∑              k        =        1            np        ⁢          xe2x80x83        ⁢                  a        k            *              y        ⁢                  (                      n            -            k                    )                    
where xe2x80x9cnpxe2x80x9d is the LPC prediction order (usually approximately 10), y(n) is sampled speech data, and xe2x80x9cnxe2x80x9d represents the time index.
The LPC equations above describe the estimation of the current sample according to the linear combination of the past samples. The difference between them is called the LPC residual, where:       r    ⁢          (      n      )        =                    y        ⁢                  (          n          )                    -                        y          ^                ⁢                  (          n          )                      =                  y        ⁢                  (          n          )                    ·                        ∑                      k            =            1                    np                ⁢                  xe2x80x83                ⁢                  α          ⁢                      xe2x80x83                    ⁢                                    a              k                        ⁡                          (              k              )                                          
A perceptual weighting W(z) filter based on the LPC filter that models the sensitivity of the human ear is then defined by:       W    ⁡          (      z      )        =                              A          ⁡                      (                          z              /                              γ                ⁢                1                                      )                                    A          ⁡                      (                          z              /                              γ                ⁢                2                                      )                              ⁢              xe2x80x83            ⁢      where      ⁢              xe2x80x83            ⁢      0         less than           γ      2         less than           γ      1        ≤    1  
The LPC prediction coefficients a1, a2, . . . , ap are quantized and used to predict the signal, where xe2x80x9cpxe2x80x9d represents the LPC order.
After removing the correlation between adjacent signals, the resulting signal is further filtered through a long term pitch predictor to extract the pitch information, and thus remove the correlation between adjacent pitch periods. The pitch data is quantized and used for predictive filtering of the speech signal. The information transmitted to the decoder includes the quantized filter parameters, gain terms, and the quantized LPC residual from the filters.
The LPC residual is modeled by samples from a stochastic codebook. Typically, the codebook comprises N excitation code-vectors, each vector having a length L. According to the analysis-by-synthesis procedure, a search of the codebook is performed to determine the best excitation code-vector which, when scaled by a gain factor and processed through the two filters (i.e., long and short term), most closely restores the pitch and voice information. The resultant signal is used to compute an optimal gain (the gain corresponding to the minimum distortion) for that particular excitation vector and an error value. This best excitation code-vector and its associated gain provide for the reproduction of xe2x80x9cgood speechxe2x80x9d as described above. An index value associated with the code-vector, as well as the optimal gain, are then transmitted to the receiver end of the decoder. At that point, the selected excitation vector is multiplied by the appropriate gain, and the signal is passed through the two filters to generate the restored speech.
To extract desired pitch parameters, the pitch parameters that minimize the following weighted coding error energy xe2x80x9cdxe2x80x9d must be calculated for each coding subframe, where one coding frame may be divided into several coding subframes for analysis and coding:
d=|Txe2x88x92xcex2PLagHxe2x88x92xcex1CiH|2
where T is the target signal that represents the perceptually filtered input signal, and H is the impulse response matrix of the filter W(z)/A(z). PLag is the pitch prediction contribution having pitch lag xe2x80x9cLagxe2x80x9d and prediction coefficient, or gain, xe2x80x9cxcex2xe2x80x9d which is uniquely defined for a given lag, and Ci is the codebook contribution associated with index xe2x80x9cixe2x80x9d in the codebook and its corresponding gain xe2x80x9cxcex1xe2x80x9d In addition, xe2x80x9cixe2x80x9d takes values between 0 and Ncxe2x88x921, where Nc is the size of the excitation codebook.
Thus, given a particular pitch lag Lag and gain xcex2, a pitch prediction contribution can be removed from the LPC residual r(n). The resulting signal
xcex5(n)=r(n)xe2x88x92xcex2e(nxe2x88x92Lag)
is called the pitch residual. The coding of this signal determines the excitation signal. In a CELP codec, the pitch residual is vector quantized by selecting an optimum codebook entry (quantizer) that best matches:
xcex5(n)=xcex1ci(n)+xcex4(n)
where c1(n) is the nth element of the ith quantizer, xcex1 is the associated gain, and xcex4(n) is the quantization error signal.
The codebook may be populated randomly or trained by selecting codebook entries frequently used in coding training data. A randomly populated codebook, for example, requires no training, or knowledge of the quantization error vectors from the previous stage. Such random codebooks also provide good quality estimation, with little or no signal dependency. A random codebook is typically populated using a Gaussian distribution, with little or no bias or assumptions of input or output coding. Nevertheless, random codebooks require substantial complexity and a significant amount of memory. In addition, random code-vectors do not accommodate the pitch harmonic phenomena, particularly where a long subframe is used.
One challenge in employing a random codebook is that a substantial amount of training is necessary to ensure xe2x80x9cgoodxe2x80x9d quality speech coding. For example, with a trained codebook, the code-vector distribution within the codebook is arranged to represent speech signal vectors. Conversely, a randomly populated codebook inherently has no such intelligent vector distribution. Thus, if the vectors happen to be distributed in an ineffective manner for encoding a given speech signal, undesirable large coding errors may result.
In a trained codebook, particular input vectors that represent the coded vector are selected. The vector having the shortest distance to other vectors within the grouping may be selected as an input vector. Upon partitioning the vector space into particular input vectors that represent each subspace, the coordinates of the representative vectors are input into the codebook. Although training avoids a codebook having disjoint and poorly organized vectors, there may be instances when the input vectors should represent very high or very low frequency speech (e.g., common female or male speech). In such cases, input vectors at opposite ends of the vector space may be desirable.
Another drawback to a trained codebook is that since the codebook is signal dependent, to develop a multi-lingual speech coder, training must accommodate a variety of different languages. Such codebook training would be intrinsically complex. In either case, whether using a conventional trained or untrained codebook, the memory storage requirements are significant. For example, in a typical 10-12 bit codebook that requires 30-40 samples, approximately 40,000 bits are necessary to store the codebook.
Various aspects of the present invention can be found in a codebook structure used in modeling and communicating speech. The codebook structure comprises an analog-to-digital (A/D) converter, speech processing circuitry for processing a digital signal received from the A/D converter, channel processing circuitry for processing the digital signal, speech memory, channel memory, additional speech processing circuitry and channel processing circuitry for further processing of the digital signal and a digital-to-analog converter (D/A). The speech memory comprises a fixed codebook and an adaptive codebook.
The speech processing circuitry comprises an adaptive codebook that receives a reconstructed speech signal, a gain that is multiplied by the output of the adaptive codebook, a fixed codebook that also receives the reconstructed speech signal, a gain that is multiplied by the output of the fixed codebook, a software control formula to sum the signals from the adaptive and fixed codebooks in order to generate an excitation signal and a synthesis filter that generates a new reconstructed speech signal from the excitation signal.
The fixed codebooks are comprised of two or more sub-codebooks. Each of the sub-codebooks is populated in such a way the corresponding code-vectors of each of the corresponding sub-codebooks are set to an energy level of one, that is, are orthogonal to each other.
The bits of the combination code-vectors are generally intertwined, but can also be combined sequentially, that is, retaining the bit order found in each of the original code-vectors prior to combination.