Speech coding refers to a process that reduces the bit rate of a speech file. Speech coding is an application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream. The objective of speech coding is to achieve savings in the required memory storage space, transmission bandwidth and transmission power by reducing the number of bits per sample such that the decoded (decompressed) speech is perceptually indistinguishable from the original speech.
However, speech are lossy coders, i.e., the decoded signal is different from the original. Therefore, one of the goals in speech coding is to minimize the distortion (or perceptible loss) at a given bit rate, or minimize the bit rate to reach a given distortion.
Speech coding differs from other forms of audio coding in that speech is a much simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information which is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and “pleasantness” of speech, with a constrained amount of transmitted data.
The intelligibility of speech includes, besides the actual literal content, also speaker identity, emotions, intonation, timbre etc. that are all important for perfect intelligibility. The more abstract concept of pleasantness of degraded speech is a different property than intelligibility, since it is possible that degraded speech is completely intelligible, but subjectively annoying to the listener.
The redundancy of speech wave forms may be considered with respect to several different types of speech signal, such as voiced and unvoiced speech signals. Voiced sounds, e.g., ‘a’, ‘b’, are essentially due to vibrations of the vocal cords, and are oscillatory. Therefore, over short periods of time, they are well modeled by sums of periodic signals such as sinusoids. In other words, for voiced speech, the speech signal is essentially periodic. However, this periodicity may be variable over the duration of a speech segment and the shape of the periodic wave usually changes gradually from segment to segment. A low bit rate speech coding could greatly benefit from exploring such periodicity. The voiced speech period is also called pitch, and pitch prediction is often named Long-Term Prediction (LTP). In contrast, unvoiced sounds such as ‘s’, ‘sh’, are more noise-like. This is because unvoiced speech signal is more like a random noise and has a smaller amount of predictability.
Traditionally, all parametric speech coding methods make use of the redundancy inherent in the speech signal to reduce the amount of information that must be sent and to estimate the parameters of speech samples of a signal at short intervals. This redundancy primarily arises from the repetition of speech wave shapes at a quasi-periodic rate, and the slow changing spectral envelop of speech signal.
The redundancy of speech wave forms may be considered with respect to several different types of speech signal, such as voiced and unvoiced. Although the speech signal is essentially periodic for voiced speech, this periodicity may be variable over the duration of a speech segment and the shape of the periodic wave usually changes gradually from segment to segment. A low bit rate speech coding could greatly benefit from exploring such periodicity. The voiced speech period is also called pitch, and pitch prediction is often named Long-Term Prediction (LTP). As for unvoiced speech, the signal is more like a random noise and has a smaller amount of predictability.
In either case, parametric coding may be used to reduce the redundancy of the speech segments by separating the excitation component of speech signal from the spectral envelop component. The slowly changing spectral envelope can be represented by Linear Prediction Coding (LPC) also called Short-Term Prediction (STP). A low bit rate speech coding could also benefit a lot from exploring such a Short-Term Prediction. The coding advantage arises from the slow rate at which the parameters change. Yet, it is rare for the parameters to be significantly different from the values held within a few milliseconds. Accordingly, at the sampling rate of 8 kHz, 12.8 kHz or 16 kHz, the speech coding algorithm is such that the nominal frame duration is in the range of ten to thirty milliseconds. A frame duration of twenty milliseconds is the most common choice.
In more recent well-known standards such as G.723.1, G.729, G.718, Enhanced Full Rate (EFR), Selectable Mode Vocoder (SMV), Adaptive Multi-Rate (AMR), Variable-Rate Multimode Wideband (VMR-WB), or Adaptive Multi-Rate Wideband (AMR-WB), Code Excited Linear Prediction Technique (“CELP”) has been adopted. CELP is commonly understood as a technical combination of Coded Excitation, Long-Term Prediction and Short-Term Prediction. CELP is mainly used to encode speech signal by benefiting from specific human voice characteristics or human vocal voice production model. CELP Speech Coding is a very popular algorithm principle in speech compression area although the details of CELP for different codecs could be significantly different. Owing to its popularity, CELP algorithm has been used in various ITU-T, MPEG, 3GPP, and 3GPP2 standards. Variants of CELP include algebraic CELP, relaxed CELP, low-delay CELP and vector sum excited linear prediction, and others. CELP is a generic term for a class of algorithms and not for a particular codec.
The CELP algorithm is based on four main ideas. First, a source-filter model of speech production through linear prediction (LP) is used. The source-filter model of speech production models speech as a combination of a sound source, such as the vocal cords, and a linear acoustic filter, the vocal tract (and radiation characteristic). In implementation of the source-filter model of speech production, the sound source, or excitation signal, is often modelled as a periodic impulse train, for voiced speech, or white noise for unvoiced speech. Second, an adaptive and a fixed codebook is used as the input (excitation) of the LP model. Third, a search is performed in closed-loop in a “perceptually weighted domain.” Fourth, vector quantization (VQ) is applied.