Switched audio coders which determine different encoding algorithms for different portions of the audio signal are known. An example is the so-called extended adaptive multi-rate-wideband codec or AMR-WB+ codec defined in the International Standard 3GPP TS 26.290 V6.1.0 2004-12. In this technical specification, the coding concept is described, which extends the ACELP (Algebraic Code Excited Linear Prediction) based AMR-WB codec by adding TCX (Transform Coded Excitation), bandwidth extension, and stereo. The AMR-WB+ audio codec processes input frames equal to 2048 samples at an internal sampling frequency FS. The internal sampling frequency is limited to the range 12,800 to 38,400 Hz. The 2048 sample frames are split into two critically sampled equal frequency bands. This results in two superframes of 1024 samples corresponding to the low-frequency (LF) and high-frequency (HF) bands. Each superframe is divided into four 256-samples frames. Sampling at the internal sampling rate is obtained by using a variable sampling conversion scheme, which re-samples the input signal. The LF and HF signals are then encoded using two different approaches. The LF signal is encoded and decoded using the “core” encoder/decoder, based on switched ACELP and TCX. In the ACELP mode, the standard AMR-WB codec is used. The HF signal is encoded with relatively few bits (16 bits/frame) using a bandwidth extension (BWE) method.
The parameters transmitted from encoder to decoder are the mode-selection bits, the LF parameters and HF signal parameters. The parameters for each 1024-sample superframe are decomposed into four packets of identical size. When the input signal is stereo, the left and right channels are combined into mono-signals for a ACELP-TCX encoding, whereas the stereo encoding receives both input channels. In the AMR-WB+ decoder structure, the LF and HF bands are decoded separately. Then, the bands are combined in a synthesis filterbank. If the output is restricted to mono only, the stereo parameters are omitted and the decoder operates in mono mode.
The AMR-WB+ codec applies LP (Linear Prediction) analysis for both the ACELP and TCX modes, when encoding the LF signal. The LP coefficients are interpolated linearly at every 64-sample sub-frame. The LP analysis window is a half-cosine of length 384 samples. The coding mode is selected based on closed-loop analysis-by-synthesis method. Only 256 sample frames are considered for ACELP frames, whereas frames of 256, 512 or 1024 samples are possible in TCX mode. The ACELP coding consists of long-term prediction (LTP) analysis and synthesis and algebraic codebook excitation. In the TCX mode, a perceptually weighted signal is processed in the transform domain. The Fourier transformed weighted signal is quantized using split multi-weight lattice quantization (algebraic vector quantization). The transform is calculated in 1024, 512 or 256 sample windows. The excitation signal is recovered by inverse filtering a quantized weighted signal through the inverse weighting filter. In order to determine whether a certain portion of the audio signal is to be encoded using the ACELP mode or the TCX mode, a closed-loop mode selection or an open-loop mode selection is used. In a closed-loop mode selection, 11 successive trials are used. Subsequent to a trial, a mode selection is made between two modes to be compared. The selection criterion is the average segmental SNR (Signal Noise Ratio) between the weighted audio signal and the synthesized weighted audio signal. Hence, the encoder performs a complete encoding in both encoding algorithms, a complete decoding in accordance with both encoding algorithms and, subsequently, the results of both encoding/decoding operations are compared to the original signal. Hence, for each encoding algorithm, i.e., ACELP on the one hand and TCX on the other hand, a segmental SNR value is obtained and the encoding algorithm having the better segmental SNR value or having a better average segmental SNR value determined over a frame by averaging over the segmental SNR values for the individual sub-frames is used.
An additional switched audio coding scheme is the so-called USAC coder (USAC=Unified Speech Audio Coding). This coding algorithm is described in ISO/IEC 23003-3. The general structure can be described as follows. First, there is a common pre/post processing system of an MPEG Surround functional unit to handle stereo or multi-channel processing and an enhanced SBR unit generating the parametric representation of the higher audio frequencies of the input signal. Then, there are two branches, one consisting of a modified advanced audio coding (AAC) tool path and the other consisting of a linear prediction coding (LP or LPC domain) based path, which in turn features either a frequency-domain representation or a time-domain representation of the LPC residual. All transmitted spectra for both, AAC and LPC, are represented in MDCT domain following quantization and arithmetic coding. The time-domain representation uses an ACELP excitation coding scheme. The functions of the decoder are to find the description of the quantized audio spectra or time-domain representation in the bitstream payload and to decode the quantized values and other reconstruction information. Hence, the encoder performs two decisions. The first decision is to perform a signal classification for frequency domain versus linear prediction domain mode decision. The second decision is to determine, within the linear prediction domain (LPD), whether a signal portion is to be encoded using ACELP or TCX.
For applying a switched audio coding scheme in scenarios, where a very low delay may be used, particular attention has to be paid to transform-based coding parts, since these coding parts introduce a certain delay which depends on the transform length and window design. Therefore, the USAC coding concept is not suitable to very low-delay applications due to the modified AAC coding branch having a considerable transform length and length adaptation (also known as block switching) involving transitional windows.
On the other hand, the AMR-WB+ coding concept was found to be problematic due to the encoder-side decision whether ACELP or TCX is to be used. ACELP provides a good coding gain, but may result in significant audio quality problems when a signal portion is not suitable for the ACELP coding mode. Hence, for quality reasons, one might be inclined to use TCX whenever the input signal does not contain speech. However, using TCX too much at low bitrates will result in bitrate problems, since TCX provides a relatively low coding gain. When one, therefore, looks more onto the coding gain, one might use ACELP whenever possible, but, as stated before, this can result in audio quality problems due to the fact that ACELP is not optimal, for example, for music and similar stationary signals.
The segmental SNR calculation is a quality measure, which determines the better coding mode only based on the result, i.e., whether the SNR between the original signal or the encoded/decoded signal is better, so that the encoding algorithm resulting in a better SNR is used. This, however, has to operate under bitrate constraints. Therefore, it has been found that only using a quality measure such as, for example, the segmental SNR measure does not always result in the best compromise between quality and bitrate.