The present invention relates to source coding and particularly to audio source coding, in which an audio signal is processed by two different audio coders having different coding algorithms.
In the context of low bitrate audio and speech coding technology, several different coding techniques have traditionally been employed in order to achieve low bitrate coding of such signals with best possible subjective quality at a given bitrate. Coders for general music/sound signals aim at optimizing the subjective quality by shaping a spectral (and temporal) shape of the quantization error according to a masking threshold curve which is estimated from the input signal by means of a perceptual model (“perceptual audio coding”). On the other hand, coding of speech at very low bitrates has been shown to work very efficiently when it is based on a production model of human speech, i.e. employing Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal.
As a consequence of these two different approaches, general audio coders, like MPEG-1 Layer 3 (MPEG=Moving Pictures Expert Group), or MPEG-2/4 Advanced Audio Coding (AAC) usually do not perform as well for speech signals at very low data rates as dedicated LPC-based speech coders due to the lack of exploitation of a speech source model. Conversely, LPC-based speech coders usually do not achieve convincing results when applied to general music signals because of their inability to flexibly shape the spectral envelope of the coding distortion according to a masking threshold curve. In the following, concepts are described which combine the advantages of both LPC-based coding and perceptual audio coding into a single framework and thus describe unified audio coding that is efficient for both general audio and speech signals.
Traditionally, perceptual audio coders use a filterbank-based approach to efficiently code audio signals and shape the quantization distortion according to an estimate of the masking curve.
FIG. 16b shows the basic block diagram of a monophonic perceptual coding system. An analysis filterbank 1600 is used to map the time domain samples into subsampled spectral components. Dependent on the number of spectral components, the system is also referred to as a subband coder (small number of subbands, e.g. 32) or a transform coder (large number of frequency lines, e.g. 512). A perceptual (“psychoacoustic”) model 1602 is used to estimate the actual time dependent masking threshold. The spectral (“subband” or “frequency domain”) components are quantized and coded 1604 in such a way that the quantiza-tion noise is hidden under the actual transmitted signal, and is not perceptible after decoding. This is achieved by varying the granularity of quantization of the spectral values over time and frequency.
The quantized and entropy-encoded spectral coefficients or subband values are, in addition with side information, input into a bitstream formatter 1606, which provides an encoded audio signal which is suitable for being transmitted or stored. The output bitstream of block 1606 can be transmitted via the Internet or can be stored on any machine readable data carrier.
On the decoder-side, a decoder input interface 1610 receives the encoded bitstream. Block 1610 separates entropy-encoded and quantized spectral/subband values from side information. The encoded spectral values are input into an entropy-decoder such as a Huffman decoder, which is positioned between 1610 and 1620. The outputs of this entropy decoder are quantized spectral values. These quantized spectral values are input into a requantizer, which performs an “inverse” quantization as indicated at 1620 in FIG. 16. The output of block 1620 is input into a synthesis filterbank 1622, which performs a synthesis filtering including a frequency/time transform and, typically, a time domain aliasing cancellation operation such as overlap and add and/or a synthesis-side windowing operation to finally obtain the output audio signal.
Traditionally, efficient speech coding has been based on Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal. Both LPC and excitation parameters are transmitted from the encoder to the decoder. This principle is illustrated in FIGS. 17a and 17b. 
FIG. 17a indicates the encoder-side of an encoding/decoding system based on linear predictive coding. The speech input is input into an LPC analyzer 1701, which provides, at its output, LPC filter coefficients. Based on these LPC filter coefficients, an LPC filter 1703 is adjusted. The LPC filter outputs a spectrally whitened audio signal, which is also termed “prediction error signal”. This spectrally whitened audio signal is input into a residual/excitation coder 1705, which generates excitation parameters. Thus, the speech input is encoded into excitation parameters on the one hand, and LPC coefficients on the other hand.
On the decoder-side illustrated in FIG. 17b, the excitation parameters are input into an excitation decoder 1707, which generates an excitation signal, which can be input into an LPC synthesis filter. The LPC synthesis filter is adjusted using the transmitted LPC filter coefficients. Thus, the LPC synthesis filter 1709 generates a reconstructed or synthesized speech output signal.
Over time, many methods have been proposed with respect to an efficient and perceptually convincing representation of the residual (excitation) signal, such as Multi-Pulse Excitation (MPE), Regular Pulse Excitation (RPE), and Code-Excited Linear Prediction (CELP).
Linear Predictive Coding attempts to produce an estimate of the current sample value of a sequence based on the observation of a certain number of past values as a linear combination of the past observations. In order to reduce redundancy in the input signal, the encoder LPC filter “whitens” the input signal in its spectral envelope, i.e. it is a model of the inverse of the signal's spectral envelope. Conversely, the decoder LPC synthesis filter is a model of the signal's spectral envelope. Specifically, the well-known auto-regressive (AR) linear predictive analysis is known to model the signal's spectral envelope by means of an all-pole approximation.
Typically, narrow band speech coders (i.e. speech coders with a sampling rate of 8 kHz) employ an LPC filter with an order between 8 and 12. Due to the nature of the LPC filter, a uniform frequency resolution is effective across the full frequency range. This does not correspond to a perceptual frequency scale.
In order to combine the strengths of traditional LPC/CELP-based coding (best quality for speech signals) and the traditional filterbank-based perceptual audio coding approach (best for music), a combined coding between these architectures has been proposed. In the AMR-WB+ (AMR-WB=Adaptive Multi-Rate WideBand) coder B. Bessette, R. Lefebvre, R. Salami, “UNIVERSAL SPEECH/AUDIO CODING USING HYBRID ACELP/TCX TECHNIQUES,” Proc. IEEE ICASSP 2005, pp. 301-304, 2005 two alternate coding kernels operate on an LPC residual signal. One is based on ACELP (ACELP=Algebraic Code Excited Linear Prediction) and thus is extremely efficient for coding of speech signals. The other coding kernel is based on TCX (TCX=Transform Coded Excitation), i.e. a filterbank based coding approach resembling the traditional audio coding techniques in order to achieve good quality for music signals. Depending on the characteristics of the input signal signals, one of the two coding modes is selected for a short period of time to transmit the LPC residual signal. In this way, frames of 80 ms duration can be split into subframes of 40 ms or 20 ms in which a decision between the two coding modes is made.
The AMR-WB+ (AMR-WB+=extended Adaptive Multi-Rate WideBand codec), cf. 3GPP (3GPP=Third Generation Partnership Project) technical specification number 26.290, version 6.3.0, June 2005, can switch between the two essentially different modes ACELP and TCX. In the ACELP mode a time domain signal is coded by algebraic code excitation. In the TCX mode a fast Fourier transform (FFT=fast Fourier transform) is used and the spectral values of the LPC weighted signal (from which the excitation signal is derived at the decoder) are coded based on vector quantization.
The decision, which modes to use, can be taken by trying and decoding both options and comparing the resulting signal-to-noise ratios (SNR=Signal-to-Noise Ratio).
This case is also called the closed loop decision, as there is a closed control loop, evaluating both coding performances and/or efficiencies, respectively, and then choosing the one with the better SNR by discarding the other.
It is well-known that for audio and speech coding applications a block transform without windowing is not feasible. Therefore, for the TCX mode the signal is windowed with a low overlap window with an overlap of ⅛th. This overlapping region is necessary, in order to fade-out a prior block or frame while fading-in the next, for example to suppress artifacts due to uncorrelated quantization noise in consecutive audio frames. This way the overhead compared to non-critical sampling is kept reasonably low and the decoding necessary for the closed-loop decision reconstructs at least ⅞th of the samples of the current frame.
The AMR-WB+ introduces ⅛th of overhead in a TCX mode, i.e. the number of spectral values to be coded is ⅛th higher than the number of input samples. This provides the disadvantage of an increased data overhead. Moreover, the frequency response of the corresponding band pass filters is disadvantageous, due to the steep overlap region of ⅛th of consecutive frames.
In order to elaborate more on the code overhead and overlap of consecutive frames, FIG. 18 illustrates a definition of window parameters. The window shown in FIG. 18 has a rising edge part on the left-hand side, which is denoted with “L” and also called left overlap region, a center region which is denoted by “1”, which is also called a region of 1 or bypass part, and a falling edge part, which is denoted by “R” and also called the right overlap region. Moreover, FIG. 18 shows an arrow indicating the region “PR” of perfect reconstruction within a frame. Furthermore, FIG. 18 shows an arrow indicating the length of the transform core, which is denoted by “T”.
FIG. 19 shows a view graph of a sequence of AMR-WB+windows and at the bottom a table of window parameters according to FIG. 18. The sequence of windows shown at the top of FIG. 19 is ACELP, TCX20 (for a frame of 20 ms duration), TCX20, TCX40 (for a frame of 40 ms duration), TCX80 (for a frame of 80 ms duration), TCX20, TCX20, ACELP, ACELP.
From the sequence of windows the varying overlapping regions can be seen, which overlap by exactly ⅛th of the center part M. The table at the bottom of FIG. 19 also shows that the transform length “T” is by ⅛th larger than the region of new perfectly reconstructed samples “PR”. Moreover, it is to be noted that this is not only the case for ACELP to TCX transitions, but also for TCXx to TCXx (where “x” indicates TCX frames of arbitrary length) transitions. Thus, in each block an overhead of ⅛th is introduced, i.e. critical sampling is never achieved.
When switching from TCX to ACELP the window samples are discarded from the FFT-TCX frame in the overlapping region, as for example indicated at the top of FIG. 19 by the region labeled with 1900. When switching from ACELP to TCX the windowed zero-input response (ZIR=zero-input response), which is also indicated by the dotted line 1910 at the top of FIG. 19, is removed at the encoder for windowing and added at the decoder for recovering. When switching from TCX to TCX frames the windowed samples are used for cross-fade. Since the TCX frames can be quantized differently quantization error or quantization noise between consecutive frames can be different and/or independent. Therewith, when switching from one frame to the next without cross-fade, noticeable artifacts may occur, and hence, cross-fade is necessary in order to achieve a certain quality.
From the table at the bottom of FIG. 19 it can be seen, that the cross-fade region grows with a growing length of the frame. FIG. 20 provides another table with illustrations of the different windows for the possible transitions in AMR-WB+. When transiting from TCX to ACELP the overlapping samples can be discarded. When transiting from ACELP to TCX, the zero-input response from the ACELP is removed at the encoder and added at the decoder for recovering.
It is a significant disadvantage of the AMR-WB+ that an overhead of ⅛th is introduced.