1. Technical Field
The present invention relates generally to speech encoding and decoding in mobile cellular communication networks and, more particularly, it relates to various techniques used with code-excited linear prediction coding to obtain high quality speech reproduction through a limited bit rate communication channel.
2. Related Art
Signal modeling and parameter estimation play significant roles in data compression, decompression, and coding. To model basic speech sounds, speech signals must be sampled as a discrete waveform to be digitally processed. In one type of signal coding technique, called linear predictive coding (LPC), the signal value at any particular time index is modeled as a linear function of previous values. A subsequent signal is thus linearly predictable according to an earlier value. As a result, efficient signal representations can be determined by estimating and applying certain prediction parameters to represent the signal.
For linear predictive analysis, neighboring speech samples are highly correlated. Coding efficiency can be improved by canceling redundancies by using a short term predictor to extract the formants of the signal. To compress speech data, it is desirable to extract only essential information to avoid transmitting redundancies. If desired, speech can be grouped into segments or short blocks, where various characteristics of the segments can be identified. "Good quality" speech may be characterized as speech that, when reproduced after having been encoded, is substantially perceptually indistinguishable from spoken speech. In order to generate good quality speech, a code excited linear predictive (CELP) speech coder must extract LPC parameters, pitch lag parameters (including lag and its associated coefficient), an optimal excitation (innovation) code-vector from a supplied codebook, and a corresponding gain parameter from the input speech. The encoder quantizes the LPC parameters by implementing appropriate coding schemes.
More particularly, the speech signal can be modeled as the output of a linear-prediction filter for the current speech coding segment, typically called frame (typical duration of about 10-40 ms), where the filter is represented by the equation: EQU A(z)=1-a.sub.1 z.sup.-1 -a.sub.2 z.sup.-2 - . . . -a.sub.np z.sup.-np
and the n.sup.th sample can be predicted by ##EQU1##
where "np" is the LPC prediction order (usually approximately 10), y(n) is sampled speech data, and "n" represents the time index.
The LPC equations above describe the estimation of the current sample according to the linear combination of the past samples. The difference between them is called the LPC residual, there: ##EQU2##
A perceptual weighting W(z) filter based on the LPC filter that models the sensitivity of the human ear is then defined by: ##EQU3##
The LPC prediction coefficients a.sub.1, a.sub.2, . . . , a.sub.p are quantized and used to predict the signal, where "p" represents the LPC order.
After removing the correlation between adjacent signals, the resulting signal is further filtered through a long term pitch predictor to extract the pitch information, and thus remove the correlation between adjacent pitch periods. The pitch data is quantized and used for predictive filtering of the speech signal. The information transmitted to the decoder includes the quantized filter parameters, gain terms, and the quantized LPC residual from the filters.
The LPC residual is modeled by samples from a stochastic codebook. Typically, the codebook comprises N excitation code-vectors, each vector having a length L. According to the analysis-by-synthesis procedure, a search of the codebook is performed to determine the best excitation code-vector which, when scaled by a gain factor and processed through the two filters (i.e., long and short term), most closely restores the pitch and voice information. The resultant signal is used to compute an optimal gain (the gain corresponding to the minimum distortion) for that particular excitation vector and an error value. This best excitation code-vector and its associated gain provide for the reproduction of "good speech" as described above. An index value associated with the code-vector, as well as the optimal gain, are then transmitted to the receiver end of the decoder. At that point, the selected excitation vector is multiplied by the appropriate gain, and the signal is passed through the two filters to generate the restored speech.
To extract desired pitch parameters, the pitch parameters that minimize the following weighted coding error energy "d" must be calculated for each coding subframe, where one coding frame may be divided into several coding subframes for analysis and coding: EQU d=.vertline.T-.beta.P.sub.Lag H-.alpha.C.sub.i H.vertline..sup.2
where T is the target signal that represents the perceptually filtered input signal, and H is the impulse response matrix of the filter W(z)/A(z). P.sub.Lag is the pitch prediction contribution having pitch Lag "Lag" and prediction coefficient, or gain, ".beta." which is uniquely defined for a given lag, and C.sub.i is the codebook contribution associated with index "i" in the codebook and its corresponding gain ".alpha.." In addition, "i" takes values between 0 and N.sub.c-l, where N.sub.c is the size of the excitation codebook.
Thus, given a particular pitch lag Lag and gain .beta., a pitch prediction contribution can be removed from the LPC residual r(n). The resulting signal EQU .epsilon.(n)=r(n)+.delta.(n)
is called the pitch residual. The coding of this signal determines the excitation signal. In a CELP codec, the pitch residual is vector quantized by selecting an optimum codebook entry (quantizer) that best matches: EQU .epsilon.(n)=.alpha.c.sub.i (n)+.delta.(n)
where c.sub.i (n) is the n.sub.th element of the i.sub.th quantizer, .alpha. is the associated gain, and .delta.(n) is the quantization error signal.
The codebook may be populated randomly or trained by selecting codebook entries frequently used in coding training data. A randomly populated codebook, for example, requires no training, or knowledge of the quantization error vectors from the previous stage. Such random codebooks also provide good quality estimation, with little or no signal dependency. A random codebook is typically populated using a Gaussian distribution, with little or no bias or assumptions of input or output coding. Nevertheless, random codebooks require substantial complexity and a significant amount of memory. In addition, random code-vectors do not accommodate the pitch harmonic phenomena, particularly where a long subframe is used.
One challenge in employing a random codebook is that a substantial amount of training is necessary to ensure "good" quality speech coding. For example, with a trained codebook, the code-vector distribution within the codebook is arranged to represent speech signal vectors. Conversely, a randomly populated codebook inherently has no such intelligent vector distribution. Thus, if the vectors happen to be distributed in an ineffective manner for encoding a given speech signal, undesirable large coding errors may result.
In a trained codebook, particular input vectors that represent the coded vector are selected. The vector having the shortest distance to other vectors within the grouping may be selected as an input vector. Upon partitioning the vector space into particular input vectors that represent each subspace, the coordinates of the representative vectors are input into the codebook. Although training avoids a codebook having disjoint and poorly organized vectors, there may be instances when the input vectors should represent very high or very low frequency speech (e.g., common female or male speech). In such cases, input vectors at opposite ends of the vector space may be desirable.
Another drawback to a trained codebook is that since the codebook is signal dependent, to develop a multi-lingual speech coder, training must accommodate a variety of different languages. Such codebook training would be intrinsically complex. In either case, whether using a conventional trained or untrained codebook, the memory storage requirements are significant. For example, in a typical 10-12 bit codebook that requires 30-40 samples, approximately 40,000 bits are necessary to store the codebook.