This invention relates to voice compression, and in particular, to code excited linear prediction (CELP) vocoding.
A voice encoder/decoder (vocoder) compresses speech signals in order to reduce the transmission bandwidth required in a communications channel. By reducing the transmission bandwidth required per call, it is possible to increase the number of calls over the same communication channel. Early speech coding techniques, such as the linear predictive coding (LPC) technique, use a filter to remove the signal redundancy and hence compress the speech signal. The LPC filter reproduces a spectral envelope that attempts to model the human voice. Furthermore, the LPC filter is excited by receiving quasi periodic inputs for nasal and vowel sounds, while receiving noise-like inputs for unvoiced sounds.
There exists a class of vocoders known as code excited linear prediction (CELP) vocoders. CELP vocoding is primarily a speech data compression technique that at 4-8 kbps can achieve speech quality comparable to other 32 kbps speech coding techniques. The CELP vocoder has two improvements over the earlier LPC techniques. First, the CELP vocoder attempts to capture more voice detail by extracting the pitch information using a pitch predictor. Secondly, the CELP vocoder excites the LPC filter with a noise like signal derived from a residual signal created from the actual speech waveform.
CELP vocoders contain three main components; 1) short term predictive filter, 2) long term predictive filter, also known as pitch predictor or adaptive codebook, and 3) fixed codebook. Compression is achieved by assigning a certain number of bits to each component which is less than the number of bits used to represent the original speech signal. The first component uses linear prediction to remove short term redundancies in the speech signal. The error, or residual, signal that results from the short term predictor becomes the target signal for the long term predictor.
Voiced speech has a quasi-periodic nature and the long term predictor extracts a pitch period from the residual and removes the information that can be predicted from the previous period. After the long term and short term filters, the residual signal is a mostly noise-like signal. Using analysis-by-synthesis, the fixed codebook search finds a best match to replace the noise-like residual with an entry from its library of vectors. The code representing the best matching vector is transmitted in place of the noisy residual. In algebraic CELP (ACELP) vocoders, the fixed codebook consists of a few non-zero pulses and is represented by the locations and signs (e.g. +1 or xe2x88x921) of the pulses.
In a typical implementation, a CELP vocoder will block or divide the incoming speech signal into frames, updating the short term predictor""s LPC coefficients once per frame. The LPC residual is then divided into subframes for the long term predictor and the fixed codebook search. For example, the input speech may be blocked into a 160 sample frame for the short term predictor. The resulting residual may then be broken up into subframes of 53 samples, 53 samples, and 54 samples. Each subframe is then processed by the long term predictor and the fixed codebook search.
Referring to FIG. 1, an example of a single frame of a speech signal 100 is shown. The speech signal 100 is made up of voiced and unvoiced signals of different pitches. The speech signal 100 is received by a CELP vocoder having an LPC filter. The first step of the CELP vocoder is to remove short term redundancies in the speech signal. The resulting signal with the short term redundancies removed is the residual speech signal 200, FIG. 2.
The LPC filter is unable to remove all of the redundant information and the remaining quasi-periodic peeks and valleys in the filtered speech signal 200 are referred to as pitch pulses. The short term predictive filter is then applied to speech signal 200 resulting in the short term filtered signal 300, FIG. 3. The long term predictor filter removes the quasi-periodic pitch pulses from the residual speech signal 300, FIG. 3, resulting in a mostly noise-like signal 400, FIG. 4, which becomes the target signal for the fixed codebook search. FIG. 4 is a plot of a 160 sample frame of a fixed codebook target signal 350 divided into three subframes 354, 356, 358. The code value is then transmitted across the communication network.
In FIG. 5, the lookup table 400 maps the position of the pulses in a subframe is shown. The pulses within the subframe are constrained to lie in one of sixteen possible positions 402 within the lookup table. Because each track 404 has sixteen possible positions 402, only four bits are required to identify each pulse location. Each pulse mapping occurs in an individual track 404. Therefore, two tracks 406, 408 are required to represent positions of two pulses in the subframe.
In the current example, the subframe 354, FIG. 4, has only 53 samples in the excitation, making position 0-52 the only valid positions. Because of the way the tracks 406, 408, FIG. 5, are divided, the tracks 406, 408 contain positions that exceed the length of the original excitation. Positions 56 and 60 in track 1, and positions 57 and 61 in track 2 are invalid and unused. The location of the first two pulse 310, 312, FIG. 4, correspond to sample thirteen and sample seventeen. By using the table 400, FIG. 5, it is determined that sample thirteen lies in position three 410 in the first track 406. The second pulse is in sample seventeen and lies in second track 408 at position four 412. Therefore, the pulses can be represented and transmitted as four bits each respectively. The other pulses 314, FIG. 4, 316, 318, 320 and 322 in the subframe 354 are ignored because the code book has only two tracks.
The only pulse position constraint is provided by the pulse position in the tracks. Disadvantageously, the CELP vocoder tends to place pulses in adjacent positions in the tracks. By placing the pulses in adjacent positions in the tracks, the start of the speech sound is encoded rather than a more balance encoding of the utterance. Additionally, as the bit rate for the vocoder decreases and fewer pulses are used, the voice quality is adversely affected by inefficient placement of pulses into tracks. What is needed is a method of further constraint of the placement of pulses in tracks in order to achieve a more balance encoding of an utterance.
The inefficiency of track positions placement is eliminated by the implementation of additional constraints that restrict the valid placement of pulses in the pulse position tracks. Implementing additional constraints for constraining the placement of pulses in tracks during encoding of a signal results in an increase in the signal quality of the decoded signal.