In modern audio/speech digital signal communication system, a digital signal is compressed at encoder. The compressed information (bitstream) can be packetized and sent to decoder through a communication channel, frame by frame. The system of an encoder and a decoder together is called a codec. Speech/audio compression may be used to reduce the number of bits that represent the speech/audio signal thereby reducing the bit rate needed for transmission. Speech/audio compression technology can be generally classified into two types, namely time domain coding and frequency domain coding. Time domain coding is usually used for coding speech signal or for coding audio signal at low bit rates. Frequency domain coding is usually used for coding audio signal or for coding speech signal at high bit rates. Bandwidth Extension (BWE) can be a part of time domain coding or frequency domain coding in order to generate a high band signal at very low bit rate or at zero bit rate.
However, speech coders are lossy coders, i.e., the decoded signal is different from the original. Therefore, one of the goals in speech coding is to minimize the distortion (or perceptible loss) at a given bit rate, or minimize the bit rate to reach a given distortion.
Speech coding differs from other forms of audio coding in that speech is a much simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information which is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and “pleasantness” of speech, with a constrained amount of transmitted data.
The intelligibility of speech includes, besides the actual literal content, also speaker identity, emotions, intonation, timbre etc. that are all important for perfect intelligibility. The more abstract concept of pleasantness of degraded speech is a different property than intelligibility, since it is possible that degraded speech is completely intelligible, but subjectively annoying to the listener.
The redundancy of speech wave forms may be considered with respect to several different types of speech signal, such as voiced and unvoiced speech signals. Voiced sounds, e.g., ‘a’, ‘b’, are essentially due to vibrations of the vocal cords, and are oscillatory. Therefore, over short periods of time, they are well modeled by sums of periodic signals such as sinusoids. In other words, for voiced speech, the speech signal is essentially periodic. However, this periodicity may be variable over the duration of a speech segment and the shape of the periodic wave usually changes gradually from segment to segment. A low bit rate speech coding could greatly benefit from exploring such periodicity. The voiced speech period is also called pitch, and pitch prediction is often named Long-Term Prediction (LTP). In contrast, unvoiced sounds such as ‘s’, ‘sh’, are more noise-like. This is because unvoiced speech signal is more like a random noise and has a smaller amount of predictability.
Traditionally, all parametric speech coding methods such as time domain coding make use of the redundancy inherent in the speech signal to reduce the amount of information that must be sent and to estimate the parameters of speech samples of a signal at short intervals. This redundancy primarily arises from the repetition of speech wave shapes at a quasi-periodic rate, and the slow changing spectral envelop of speech signal.
The redundancy of speech wave forms may be considered with respect to several different types of speech signal, such as voiced and unvoiced. Although the speech signal is essentially periodic for voiced speech, this periodicity may be variable over the duration of a speech segment and the shape of the periodic wave usually changes gradually from segment to segment. A low bit rate speech coding could greatly benefit from exploring such periodicity. The voiced speech period is also called pitch, and pitch prediction is often named Long-Term Prediction (LTP). As for unvoiced speech, the signal is more like a random noise and has a smaller amount of predictability.
In either case, parametric coding may be used to reduce the redundancy of the speech segments by separating the excitation component of speech signal from the spectral envelop component. The slowly changing spectral envelope can be represented by Linear Prediction Coding (LPC) also called Short-Term Prediction (STP). A low bit rate speech coding could also benefit a lot from exploring such a Short-Term Prediction. The coding advantage arises from the slow rate at which the parameters change. Yet, it is rare for the parameters to be significantly different from the values held within a few milliseconds. Accordingly, at the sampling rate of 8 kHz, 12.8 kHz or 16 kHz, the speech coding algorithm is such that the nominal frame duration is in the range of ten to thirty milliseconds. A frame duration of twenty milliseconds is the most common choice.
Audio coding based on filter bank technology is widely used, e.g., in frequency domain coding. In signal processing, a filter bank is an array of band-pass filters that separates the input signal into multiple components, each one carrying a single frequency sub-band of the original signal. The process of decomposition performed by the filter bank is called analysis, and the output of filter bank analysis is referred to as a sub-band signal with as many sub-bands as there are filters in the filter bank. The reconstruction process is called filter bank synthesis. In digital signal processing, the term filter bank is also commonly applied to a bank of receivers. The difference is that receivers also down-convert the sub-bands to a low center frequency that can be re-sampled at a reduced rate. The same result can sometimes be achieved by undersampling the bandpass sub-bands. The output of filter bank analysis could be in a form of complex coefficients. Each complex coefficient contains real element and imaginary element respectively representing cosine term and sine term for each sub-band of filter bank.
In more recent well-known standards such as G.723.1, G.729, G.718, Enhanced Full Rate (EFR), Selectable Mode Vocoder (SMV), Adaptive Multi-Rate (AMR), Variable-Rate Multimode Wideband (VMR-WB), or Adaptive Multi-Rate Wideband (AMR-WB), Code Excited Linear Prediction Technique (“CELP”) has been adopted. CELP is commonly understood as a technical combination of Coded Excitation, Long-Term Prediction and Short-Term Prediction. CELP is mainly used to encode speech signal by benefiting from specific human voice characteristics or human vocal voice production model. CELP Speech Coding is a very popular algorithm principle in speech compression area although the details of CELP for different codecs could be significantly different. Owing to its popularity, CELP algorithm has been used in various ITU-T, MPEG, 3GPP, and 3GPP2 standards. Variants of CELP include algebraic CELP, relaxed CELP, low-delay CELP and vector sum excited linear prediction, and others. CELP is a generic term for a class of algorithms and not for a particular codec.
The CELP algorithm is based on four main ideas. First, a source-filter model of speech production through linear prediction (LP) is used. The source-filter model of speech production models speech as a combination of a sound source, such as the vocal cords, and a linear acoustic filter, the vocal tract (and radiation characteristic). In implementation of the source-filter model of speech production, the sound source, or excitation signal, is often modelled as a periodic impulse train, for voiced speech, or white noise for unvoiced speech. Second, an adaptive and a fixed codebook is used as the input (excitation) of the LP model. Third, a search is performed in closed-loop in a “perceptually weighted domain.” Fourth, vector quantization (VQ) is applied.