This invention relates to digital speech encoders using code excited linear prediction coding, or CELP. More particularly, this invention relates a method and apparatus for efficiently selecting a desired codevector used to reproduce an encoded speech segment at the decoder.
Direct quantization of analog speech signals is too inefficient for effective bandwidth utilization. A technique known as linear predictive coding, or LPC, which takes advantage of speech signal redundancies, requires much fewer bits to transmit or store speech signals. Originally speech signals are produced as a result of acoustical excitation of the vocal tract. While the vocal cords produce the acoustical excitation, the vocal tract (e.g. mouth, tongue and lips) acts as a time varying filter of the vocal excitation. Thus, speech signals can be efficiently represented as a quasi-periodic excitation signal plus the time varying parameters of a digital filter. In addition, the periodic nature of the vocal excitation can further be represented by a linear filter excited by a noise-like Gaussian sequence. Thus, in CELP, a first long delay predictor corresponds to the pitch periodicity of the human vocal cords, and a second short delay predictor corresponds to the filtering action of the human vocal tract
CELP reproduces the individual speaker""s voice by processing the input speech to determine the desired excitation sequence and time varying digital filter parameters. At the encoder, a prediction filter forms an estimate for the current sample of the input signal based on the past reconstructed values of the signal at the receiver decoder, i.e. the transmitter encoder predicts the value that the receiver decoder will reconstruct. The difference between the current value and predicted value of the input signal is the prediction error. For each frame of speech, the prediction residual and filter parameters are communicated to the receiver. The prediction residual or prediction error is also known as the innovation sequence and is used at the receiver as the excitation input to the prediction filters to reconstruct the speech signal. Each sample of the reconstructed speech signal is produced by adding the received signal to the predicted estimate of the present sample. For each successive speech frame, the innovation sequence and updated filter parameters are communicated to the receiver decoder.
The innovation sequence is typically encoded using codebook encoding. In codebook encoding, each possible innovation sequence is stored as an entry in a codebook and each is represented by an index. The transmitter and receiver both have the same codebook contents. To communicate an given innovation sequence, the index for that innovation sequence in the transmitter codebook is transmitted to the receiver. At the receiver, the received index is used to look up the desired innovation sequence in the receiver codebook for use as the excitation sequence to the time varying digital filters.
The task of the CELP encoder is to generate the time varying filter coefficients and the innovation sequence in real time. The difficulty of rapidly selecting the best innovation sequence from a set of possible innovation sequences for each frame of speech is an impediment to commercial achievement of real time CELP based systems, such as cellular telephone, voice mail and the like.
Both random and deterministic codebooks are known. Random codebooks are used because the probability density function of the prediction error samples has been shown to be nearly white Gaussian random noise. However, random codebooks present a heavy computational burden to select an innovation sequence from the codebook at the encoder since the codebook must be exhaustively searched.
To select an innovation sequence from the codebook of stored innovation sequences, a given fidelity criterion is used. Each innovation sequence is filtered through time varying linear recursive filters to reconstruct (predict) the speech frame as it would be reconstructed at the receiver. The predicted speech frame using the candidate innovation sequence is compared with the desired target speech frame (filtered through a perceptual weighting filter) and the fidelity criterion is calculated. The process is repeated for each stored innovation sequence. The innovation sequence that maximizes the fidelity criterion function is selected as the optimum innovation sequence, and an index representing the selected optimum sequence is sent to the receiver, along with other filter parameters.
At the receiver, the index is used to access the selected innovation sequence, and, in conjunction with the other filter parameters, to reconstruct the desired speech.
The central problem is how to select an optimum innovation sequence from the codebook at the encoder within the constraints of real time speech encoding and acceptable transmission delay. In a random codebook, the innovation sequences are independently generated random white Gaussian sequences. The computational burden of performing an exhaustive search of all the innovation sequences in the random code book is extremely high because each innovation sequence must be passed through the prediction filters.
One prior art solution to the problem of selecting an innovation-sequence is found in U.S. Pat. No. 4,797,925 in which the adjacent codebook entries have a subset of elements in common. In particular, each succeeding code sequence may be generated from the previous code sequence by removing one or more elements from the beginning of the previous sequence and adding one or more elements to the end of the previous sequence. The filter response to each succeeding code sequence is then generated from the filter response to the preceding code sequence by subtracting the filter response to the first samples and appending the filter response to the added samples. Such overlapping codebook structure permits accelerated calculation of the fidelity criterion.
Another prior art solution to the problem of rapidly selecting an optimum innovation sequence is found in U.S. Pat. No. 4,817,157 in which the codebook of excitation vectors is derived from a set of M basis vectors which are used to generate a set of 2M codebook excitation code vectors. The entire codebook of 2M possible excitation vectors is searched using the knowledge of how the code vectors are generated from. the basis vectors, without having to generate and evaluate each of the individual code vectors.
The present invention is embodied in a speech communication system using a ternary innovation codebook which is formed by the sum of two binary codebooks. The ternary codebook has code sequences Ck, constructed from the set of values, {xe2x88x921,0,1}. To form the ternary codebook, one binary codebook has the values {0,1}, and the other binary codebook has the values {xe2x88x921,0}. The sum of one binary codevector from each binary codebook forms a ternary codevector. The codebook structure of the present invention, permits several efficient search procedures and reduced storage. For example, a ternary codebook of 256 sequences may be formed from two binary codebooks of 16 each (32 total). Each of the 256 ternary sequences is formed as the sum of 1 of 16 binary sequences from the first binary codebook and 1 of 16 binary sequences from the second binary codebook.
More important than reduced storage, the binary codebooks may be efficiently searched for optimum values of a given fidelity criterion function. The computational burden of searching for optimum sequences is eased because there are fewer sequences (32 verses 256 in the above example) to filter and correlate in computing the fidelity criterion function, even for an exhaustive search of all combinations of the two binary codebooks. Since the processing is linear, the principle of superposition may be used to obtain the result of ternary codevector processing by adding the results of binary codevector processing. In addition, as alternate embodiments to an exhaustive search of the binary codebooks, two sub-optimum searches are possible.
In the first sub-optimum search, each binary codebook is independently searched for a subset of optimum binary codevectors, say for example, the 5 best binary codevectors of each codebook of 16 codevectors is found, forming two optimum codevector subsets of 5 codevectors each. Then an exhaustive search of all combinations (25 in this example) of the optimum codevector subsets is performed. For the subset exhaustive search calculation, the filtering and auto-correlation terms from the first calculation of the optimum codevector subsets are available for reuse in the subsequent exhaustive search. In addition, the number of cross-correlation calculations, also 25, is substantially reduced compared to the number of cross-correlation calculations required in an exhaustive search of the full codebook sets, i.e. 256.
In a second sub-optimum search, the one best binary codevector is found from the set consisting of both the first and second binary codebooks. Then an exhaustive search is performed using the one best binary codevector in combination with each of the codevectors from the other binary codebook which did not contain the one best binary codevector. In the second sub-optimum search, the filtering and auto-correlation terms from the first calculation of the fidelity criterion function for the one best binary codevector are available for reuse in the subsequent exhaustive search of the other binary codebook. In addition, the number of cross-correlation calculations is further reduced to 16, which is less than the number of cross-correlation calculations required in an exhaustive search of the full codebook sets or using the optimum subsets.