Digital representation of information provides many advantages. In the case of sound signals, the information such as a speech or music signal is digitized using, for example, the PCM (Pulse Code Modulation) format. The signal is thus sampled and quantized with, for example, 16 or 20 bits per sample. Although simple, the PCM format requires a high bit rate (number of bits per second or bit/s). This limitation is the main motivation for designing efficient source coding techniques capable of reducing the source bit rate and meet with the specific constraints of many applications in terms of audio quality, coding delay, and complexity.
The function of a digital audio coder is to convert a sound signal into a bit stream which is, for example, transmitted over a communication channel or stored in a storage medium. Here lossy source coding, i.e. signal compression, is considered. More specifically, the role of a digital audio coder is to represent the samples, for example the PCM samples with a smaller number of bits while maintaining a good subjective audio quality. A decoder or synthesizer is responsive to the transmitted or stored bit stream to convert it back to a sound signal. Reference is made to [Jayant, 1984] and [Gersho, 1992] for an introduction to signal compression methods, and to the general chapters of [Kleijn, 1995] for an in-depth coverage of modern speech and audio coding techniques.
In high-quality audio coding, two classes of algorithms can be distinguished: Code-Excited Linear Prediction (CELP) coding which is designed to code primarily speech signals, and perceptual transform (or sub-band) coding which is well adapted to represent music signals. These techniques can achieve a good compromise between subjective quality and bit rate. CELP coding has been developed in the context of low-delay bidirectional applications such as telephony or conferencing, where the audio signal is typically sampled at, for example, 8 or 16 kHz. Perceptual transform coding has been applied mostly to wideband high-fidelity music signals sampled at, for example, 32, 44.1 or 48 kHz for streaming or storage applications.
CELP coding [Atal, 1985] is the core framework of most modern speech coding standards. According to this coding model, the speech signal is processed in successive blocks of N samples called frames, where N is a predetermined number of samples corresponding typically to, for example, 10-30 ms. The reduction of bit rate is achieved by removing the temporal correlation between successive speech samples through linear prediction and using efficient vector quantization (VQ). A linear prediction (LP) filter is computed and transmitted every frame. The computation of the LP filter typically requires a look-ahead, for example a 5-10 ms speech segment from the subsequent frame. In general, the N-sample frame is divided into smaller blocks called sub-frames, so as to apply pitch prediction. The sub-frame length can be set, for example, in the range 4-10 ms. In each subframe, an excitation signal is usually obtained from two components, a portion of the past excitation and an innovative or fixed-codebook excitation. The component formed from a portion of the past excitation is often referred to as the adaptive codebook or pitch excitation. The parameters characterizing the excitation signal are coded and transmitted to the decoder, where the excitation signal is reconstructed and used as the input of the LP filter. An instance of CELP coding is the ACELP (Algebraic CELP) coding model, wherein the innovative codebook consists of interleaved signed pulses.
The CELP model has been developed in the context of narrow-band speech coding, for which the input bandwidth is 300-3400 Hz. In the case of wideband speech signals defined in the 50-7000 Hz band, the CELP model is usually used in a split-band approach, where a lower band is coded by waveform matching (CELP coding) and a higher band is parametrically coded. This bandwidth splitting has several motivations:                Most of the bits of a frame can be allocated to the lower-band signal to maximize quality.        The computational complexity (of filtering, etc.) can be reduced compared to full-band coding.        Also, waveform matching is not very efficient for high-frequency components.This split-band approach is used for instance in the ETSI AMR-WB wideband speech coding standard. This coding standard is specified in [3GPP TS 26.190] and described in [Bessette, 2002]. The implementation of the AMR-WB standard is given in [3GPP TS 26.173]. The AMR-WB speech coding algorithm consists essentially of splitting the input wideband signal into a lower band (0-6400 Hz) and a higher band (6400-7000 Hz), and applying the ACELP algorithm to only the lower band and coding the higher band through bandwidth extension (BWE).        
The state-of-the-art audio coding techniques, for example MPEG-AAC or ITU-T G.722.1, are built upon perceptual transform (or sub-band) coding. In transform coding, the time-domain audio signal is processed by overlapping windows of appropriate length. The reduction of bit rate is achieved by the de-correlation and energy compaction property of a specific transform, as well as coding of only the perceptually relevant transform coefficients. The windowed signal is usually decomposed (analyzed) by a discrete Fourier transform (DFT), a discrete cosine transform (DCT) or a modified discrete cosine transform (MDCT). A frame length of, for example, 40-60 ms is normally needed to achieve good audio quality. However, to represent transients and avoid time spreading of coding noise before attacks (pre-echo), shorter frames of, for example, 5-10 ms are also used to describe non-stationary audio segments. Quantization noise shaping is achieved by normalizing the transform coefficients with scale factors prior to quantization. The normalized coefficients are typically coded by scalar quantization followed by Huffman coding. In parallel, a perceptual masking curve is computed to control the quantization process and optimize the subjective quality; this curve is used to code the most perceptually relevant transform coefficients.
To improve the coding efficiency (in particular at low bit rates), band splitting can also be used with transform coding. This approach is used for instance in the new High Efficiency MPEG-AAC standard also known as aacPlus. In aacPlus, the signal is split into two sub-bands, the lower-band signal is coded by perceptual transform coding (AAC), while the higher-band signal is described by so-called Spectral Band Replication (SBR) which is a kind of bandwidth extension (BWE).
In certain applications, such as audio/video conferencing, multimedia storage and internet audio streaming, the audio signal consists typically of speech, music and mixed content. As a consequence, in such applications, an audio coding technique which is robust to this type of input signal is used. In other words, the audio coding algorithm should achieve a good and consistent quality for a wide class of audio signals, including speech and music. Nonetheless, the CELP technique is known to be intrinsically speech-optimized but may present problems when used to code music signals. State-of-the art perceptual transform coding on the other hand has good performance for music signals, but is not appropriate for coding speech signals, especially at low bit rates.
Several approaches have then been considered to code general audio signals, including both speech and music, with a good and fairly constant quality. Transform predictive coding as described in [Moreau, 1992] [Lefebvre, 1994] [Chen, 1996] and [Chen, 1997], provides a good foundation for the inclusion of both speech and music coding techniques into a single framework. This approach combines linear prediction and transform coding. The technique of [Lefebvre, 1994), called TCX (Transform Coded eXcitation) coding, which is equivalent to those of [Moreau, 1992], [Chen, 1996] and [Chen, 1997] will be considered in the following-description.
Originally, two variants of TCX coding have been designed [Lefebvre, 1994]: one for speech signals using short frames and pitch prediction, another for music signals with long frames and no pitch prediction. In both cases, the processing involved in TCX coding can be decomposed in two steps:    1) The current frame of audio signal is processed by temporal filtering to obtain a so-called target signal, and then    2) The target signal is coded in transform domain.Transform coding of the target signal uses a DFT with rectangular windowing. Yet, to reduce blocking artifacts at frame boundaries, a windowing with small overlap has been used in [Jbira, 1998] before the DFT. In [Ramprashad, 2001], a MDCT with windowing switching is used instead; the MDCT has the advantage to provide a better frequency resolution than the DFT while being a maximally-decimated filter-bank. However, in the case of [Ramprashad, 2001], the coder does not operate in closed-loop, in particular for pitch analysis. In this respect, the coder of [Ramprashad, 2001] cannot be qualified as a variant of TCX.
The representation of the target signal not only plays a role in TCX coding but also controls part of the TCX audio quality, because it consumes most of the available bits in every coding frame. Reference is made here to transform coding in the DFT domain. Several methods have been proposed to code the target signal in this domain, see for instance [Lefebvre, 1994], [Xie, 1996], [Jbira, 1998], [Schnitzler, 1999] and [Bessette, 1999]. All these methods implement a form of gain-shape quantization, meaning that the spectrum of the target signal is first normalized by a factor or global gain g prior to the actual coding. In [Lefebvre, 1994], [Xie, 1996] and [Jbira, 1998], this factor g is set to the RMS (Root Mean Square) value of the spectrum. However, in general, it can be optimized in each frame by testing different values for the factor g, as disclosed for example in [Schnitzler, 1999] and [Bessette, 1999]. [Bessette, 1999] does not disclose actual optimisation of the factor g. To improve the quality of TCX coding, noise fill-in (i.e. the injection of comfort noise in lieu of unquantized coefficients) has been used in [Schnitzler, 1999] and [Bessette, 1999].
As explained in [Lefebvre, 1994], TCX coding can quite successfully code wideband signals, for example signals sampled at 16 kHz; the audio quality is good for speech at a sampling rate of 16 kbit/s and for music at a sampling rate of 24 kbit/s. However, TCX coding is not as efficient as ACELP for coding speech signals. For that reason, a switched ACELP/TCX coding strategy has been presented briefly in [Bessette, 1999]. The concept of ACELP/TCX coding is similar for instance to the ATCELP (Adaptive Transform and CELP) technique of [Combescure, 1999]. Obviously, the audio quality can be maximized by switching between different modes, which are actually specialized to code a certain type of signal. For instance, CELP coding is specialized for speech and transform coding is more adapted to music, so it is natural to combine these two techniques into a multi-mode framework in which each audio frame is coded adaptively with the most appropriate coding tool. In ATCELP coding, the switching between CELP and transform coding is not seamless; it requires transition modes. Furthermore, an open-loop mode decision is applied, i.e. the mode decision is made prior to coding based on the available audio signal. On the contrary, ACELP/TCX presents the advantage of using two homogeneous linear predictive modes (ACELP and TCX coding), which makes switching easier; moreover, the mode decision is closed-loop, meaning that all coding modes are tested and the best synthesis can be selected.
Although [Bessette, 1999] briefly presents a switched ACELP/TCX coding strategy, [Bessette, 1999] does not disclose the ACELP/TCX mode decision and details of the quantization of the TCX target signal in ACELP/TCX coding. The underlying quantization method is only known to be based on self-scalable multi-rate lattice vector quantization, as introduced by [Xie, 1996].
Reference is made to [Gibson, 1988] and [Gersho, 1992] for an introduction to lattice vector quantization. An N-dimensional lattice is a regular array of points in the N-dimensional (Euclidean) space. For instance, [Xie, 1996] uses an 8-dimensional lattice, known as the gosset lattice, which is defined as:RE8=2D8∪{2D8+(1, . . . ,1)}  (1)whereD8={(x1, . . . ,x8)εZ8|x1+ . . . +x8 is odd}  (2)andD8+(1, . . . ,1)={(x1+1, . . . ,x8+1)εZ8|(x1, . . . ,x8)εD8}  (3)
This mathematical structure enables the quantization of a block of eight (8) real numbers. RE8 can be also defined more intuitively as the set of points (x1, . . . , x8) verifying the properties:    i. The components xi are signed integers (for i=1, . . . , 8);    ii. The sum x1+ . . . +x8 is a multiple of 4; and    iii. The components xi have the same parity (for i=1, . . . , 8), i.e. they are either all even, or all odd.An 8-dimensional quantization codebook can then be obtained by selecting a finite subset of RE8. Usually the mean-square error is the codebook search criterion. In the technique of [Xie, 1996], six (6) different codebooks, called Q0, Q1, . . . , Q5, are defined based on the RE8 lattice. Each codebook Qn where n=0, 1, . . . , 5, comprises 24n points, which corresponds to a rate of 4n bits per 8-dimensional sub-vector or n/2 bits per sample. The spectrum of the TCX target signal, normalized by a scaled factor g, is then quantized by splitting it into 8-dimensional sub-vectors (or sub-bands). Each of these sub-vectors is coded into one of the codebooks Q0, Q1, . . . , Q5. As a consequence, the quantization of the TCX target signal, after normalization by the factor g produces for each 8-dimensional sub-vector a codebook number n indicating which codebook Qn has been used and an index i identifying a specific codevector in the codebook Qn. This quantization process is referred to as multi-rate lattice vector quantization, for the codebooks Qn having different rates. The TCX mode of [Bessette, 1999] follows the same principle, yet no details are provided on the computation of the normalization factor g nor on the multiplexing of quantization indices and codebooks numbers.
The lattice vector quantization technique of [Xie; 1996] based on RE8 has been extended in [Ragot, 2002] to improve efficiency and reduce complexity. However, the application of the concept described by [Ragot, 2002] to TCX coding has never been proposed.
In the device of [Ragot, 2002], an 8-dimensional vector is coded through a multi-rate quantizer incorporating a set of RE8 codebooks denoted as {Q0, Q2, Q3, . . . , Q36}. The codebook Q1 is not defined in the set in order to improve coding efficiency. All codebooks Qn are constructed as subsets of the same 8-dimensional RE8 lattice, Qn⊂RE8. The bit rate of the nth codebook defined as bits per dimension is 4n/8, i.e. each codebook Qn contains 24n codevectors. The construction of the multi-rate quantizer follows the teaching of [Ragot, 2002]. For a given 8-dimensional input vector, the coder of the multi-rate quantizer finds the nearest neighbor in RE8, and outputs a codebook number n and an index i in the corresponding codebook Qn. Coding efficiency is improved by applying an entropy coding technique for the quantization indices, i.e. codebook numbers n and indices i of the splits. In [Ragot, 2002], a codebook number n is coded prior to multiplexing to the bit stream with an unary code that comprises a number n−1 of 1's and a zero stop bit. The codebook number represented by the unary code is denoted by nE. No entropy coding is employed for codebook indices i. The unary code and bit allocation of nE and i is exemplified in the following Table 1.
TABLE 1The number of bits required to index the codebooks.Unary codeNumber ofCodebooknEk inNumber ofNumber ofbits pernumber nkbinary formbits for nEkbits for lksplit001012102810311031215411104162051111052025. . .. . .. . .. . .. . .
As illustrated in Table 1, one bit is required for coding the input vector when n=0 and otherwise 5n bits are required.
Furthermore, a practical issue in audio coding is the formatting of the bit stream and the handling of bad frames, also known as frame-erasure concealment. The bit stream is usually formatted at the coding side as successive frames (or blocks) of bits. Due to channel impairments (e.g. CRC (Cyclic Redundancy Check) violation, packet loss or delay, etc.), some frames may not be received correctly at the decoding side. In such a case, the decoder typically receives a flag declaring a frame erasure and the bad frame is “decoded” by extrapolation based on the past history of the decoder. A common procedure to handle bad frames in CELP decoding consists of reusing the past LP synthesis filter, and extrapolating the previous excitation.
To improve the robustness against frame losses, parameter repetition, also know as Forward Error Correction or FEC coding may be used.
The problem of frame-erasure concealment for TCX or switched ACELP/TCX coding has not been addressed yet in the current technology.