A speech encoder converts a speech signal into a digital bit stream which is transmitted over a communication channel or stored in a storage medium. The speech signal is digitized, that is sampled and quantized with usually 16-bits per sample. The speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a speech signal.
Code-Excited Linear Prediction (CELP) coding is one of the best prior art techniques for achieving a good compromise between subjective quality and bit rate. This coding technique forms the basis of several speech coding standards both in wireless and wireline applications. In CELP coding, the sampled speech signal is processed in successive blocks of M samples usually called frames, where M is a predetermined number corresponding typically to 10-30 ms. A linear prediction (LP) filter is computed and transmitted every frame. The computation of the LP filter typically needs a lookahead, a 5-15 ms speech segment from the subsequent frame. The M-sample frame is divided into smaller blocks called subframes. Usually the number of subframes is three or four resulting in 4-10 ms subframes. In each subframe, an excitation signal is usually obtained from two components, the past excitation and the innovative, fixed-codebook excitation. The component formed from the past excitation is often referred to as the adaptive codebook or pitch excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the reconstructed excitation signal is used as the input of the LP filter.
CELP-type speech codecs rely heavily on prediction to achieve their high performance. The prediction used can be of different kinds but usually comprises the use of an adaptive codebook containing an excitation signal selected in past frames. A CELP encoder exploits the quasi periodicity of voiced speech signal by searching in the past excitation the segment most similar to the segment being currently encoded. The same past excitation signal is maintained also in the decoder. It is then sufficient for the encoder to send a delay parameter and a gain for the decoder to reconstruct the same excitation signal as is used in the encoder. The evolution (difference) between the previous speech segment and the currently encoded speech segment is further modeled using an innovation selected from a fixed codebook. The CELP technology will be described in more detail herein below.
A problem of strong prediction inherent in CELP-based speech coders appears in presence of transmission errors (erased frames or packets) when the state of the encoder and the decoder become desynchronized. Due to the prediction, the effect of an erased frame is thus not limited to the erased frame, but continues to propagate after the erasure, often during several following frames. Naturally, the perceptual impact can be very annoying.
Transitions from unvoiced speech segment to voiced speech segment (e.g. transition between a consonant or a period of inactive speech, and a vowel) or transitions between two different voiced segments (e.g. transitions between two vowels) are the most problematic cases for frame erasure concealment. When a transition from unvoiced speech segment to voiced speech segment (voiced onset) is lost, the frame right before the voiced onset frame is unvoiced or inactive and thus no meaningful periodic excitation is found in the buffer of the past excitation (adaptive codebook). At the encoder, the past periodic excitation builds up in the adaptive codebook during the onset frame, and the following voiced frame is encoded using this past periodic excitation. Most frame error concealment techniques use the information from the last correctly received frame to conceal the missing frame. When the onset frame is lost, the decoder past excitation buffer will be thus updated using the noise-like excitation of the previous frame (unvoiced or inactive frame). The periodic part of the excitation is thus completely missing in the adaptive codebook at the decoder after a lost voiced onset and it can take up to several frames for the decoder to recover from this loss.
A similar situation occurs in the case of lost voiced to voiced transition. In that case, the excitation stored in the adaptive codebook before the transition frame has typically very different characteristics from the excitation stored in the adaptive codebook after the transition. Again, as the decoder usually conceals the lost frame with the use of the past frame information, the state of the encoder and the decoder will be very different, and the synthesized signal can suffer from important distortion.