In the context of low bitrate audio and speech coding technology, several different coding techniques have traditionally been employed in order to achieve low bitrate coding of such signals with best possible subjective quality at a given bitrate. Coders for general music/sound signals aim at optimizing the subjective quality by shaping a spectral (and temporal) shape of the quantization error according to a masking threshold curve which is estimated from the input signal by means of a perceptual model (“perceptual audio coding”). On the other hand, coding of speech at very low bitrates has been shown to work very efficiently when it is based on a production model of human speech, i.e. employing Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal.
As a consequence of these two different approaches, general audio coders, like MPEG-1 Layer 3 (MPEG=Moving Pictures Expert Group), or MPEG-2/4 Advanced Audio Coding (AAC) usually do not perform as well for speech signals at very low data rates as dedicated LPC-based speech coders due to the lack of exploitation of a speech source model. Conversely, LPC-based speech coders usually do not achieve convincing results when applied to general music signals because of their inability to flexibly shape the spectral envelope of the coding distortion according to a masking threshold curve. In the following, concepts are described which combine the advantages of both LPC-based coding and perceptual audio coding into a single framework and thus describe unified audio coding that is efficient for both general audio and speech signals.
Traditionally, perceptual audio coders use a filterbank-based approach to efficiently code audio signals and shape the quantization distortion according to an estimate of the masking curve.
FIG. 16 shows the basic block diagram of a monophonic perceptual coding system. An analysis filterbank 1600 is used to map the time domain samples into subsampled spectral components. Dependent on the number of spectral components, the system is also referred to as a subband coder (small number of subbands, e.g. 32) or a transform coder (large number of frequency lines, e.g. 512). A perceptual (“psychoacoustic”) model 1602 is used to estimate the actual time dependent masking threshold. The spectral (“subband” or “frequency domain”) components are quantized and coded 1604 in such a way that the quantiza-tion noise is hidden under the actual transmitted signal, and is not perceptible after decoding. This is achieved by varying the granularity of quantization of the spectral values over time and frequency.
The quantized and entropy-encoded spectral coefficients or subband values are, in addition with side information, input into a bitstream formatter 1606, which provides an encoded audio signal which is suitable for being transmitted or stored. The output bitstream of block 1606 can be transmitted via the Internet or can be stored on any machine readable data carrier.
On the decoder-side, a decoder input interface 1610 receives the encoded bitstream. Block 1610 separates entropy-encoded and quantized spectral/subband values from side information. The encoded spectral values are input into an entropy-decoder such as a Huffman decoder, which is positioned between 1610 and 1620. The outputs of this entropy decoder are quantized spectral values. These quantized spectral values are input into a requantizer, which performs an “inverse” quantization as indicated at 1620 in FIG. 16. The output of block 1620 is input into a synthesis filterbank 1622, which performs a synthesis filtering including a frequency/time transform and, typically, a time domain aliasing cancellation operation such as overlap and add and/or a synthesis-side windowing operation to finally obtain the output audio signal.
Traditionally, efficient speech coding has been based on Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal. Both LPC and excitation parameters are transmitted from the encoder to the decoder. This principle is illustrated in FIGS. 17a and 17b. 
FIG. 17a indicates the encoder-side of an encoding/decoding system based on linear predictive coding. The speech input is input into an LPC analyzer 1701, which provides, at its output, LPC filter coefficients. Based on these LPC filter coefficients, an LPC filter 1703 is adjusted. The LPC filter outputs a spectrally whitened audio signal, which is also termed “prediction error signal”. This spectrally whitened audio signal is input into a residual/excitation coder 1705, which generates excitation parameters. Thus, the speech input is encoded into excitation parameters on the one hand, and LPC coefficients on the other hand.
On the decoder-side illustrated in FIG. 17b, the excitation parameters are input into an excitation decoder 1707, which generates an excitation signal, which can be input into an LPC synthesis filter. The LPC synthesis filter is adjusted using the transmitted LPC filter coefficients. Thus, the LPC synthesis filter 1709 generates a reconstructed or synthesized speech output signal.
Over time, many methods have been proposed with respect to an efficient and perceptually convincing representation of the residual (excitation) signal, such as Multi-Pulse Excitation (MPE), Regular Pulse Excitation (RPE), and Code-Excited Linear Prediction (CELP).
Linear Predictive Coding attempts to produce an estimate of the current sample value of a sequence based on the observation of a certain number of past values as a linear combination of the past observations. In order to reduce redundancy in the input signal, the encoder LPC filter “whitens” the input signal in its spectral envelope, i.e. it is a model of the inverse of the signal's spectral envelope. Conversely, the decoder LPC synthesis filter is a model of the signal's spectral envelope. Specifically, the well-known auto-regressive (AR) linear predictive analysis is known to model the signal's spectral envelope by means of an all-pole approximation.
Typically, narrow band speech coders (i.e. speech coders with a sampling rate of 8 kHz) employ an LPC filter with an order between 8 and 12. Due to the nature of the LPC filter, a uniform frequency resolution is effective across the full frequency range. This does not correspond to a perceptual frequency scale.
In order to combine the strengths of traditional LPC/CELP-based coding (best quality for speech signals) and the traditional filterbank-based perceptual audio coding approach (best for music), a combined coding between these architectures has been proposed. In the AMR-WB+ (AMR-WB=Adaptive Multi-Rate WideBand) coder B. Bessette, R. Lefebvre, R. Salami, “UNIVERSAL SPEECH/AUDIO CODING USING HYBRID ACELP/TCX TECHNIQUES,” Proc. IEEE ICASSP 2005, pp. 301-304, 2005 two alternate coding kernels operate on an LPC residual signal. One is based on ACELP (ACELP=Algebraic Code Excited Linear Prediction) and thus is extremely efficient for coding of speech signals. The other coding kernel is based on TCX (TCX=Transform Coded Excitation), i.e. a filterbank based coding approach resembling the traditional audio coding techniques in order to achieve good quality for music signals. Depending on the characteristics of the input signals, one of the two coding modes is selected for a short period of time to transmit the LPC residual signal. In this way, frames of 80 ms duration can be split into subframes of 40 ms or 20 ms in which a decision between the two coding modes is made.
The AMR-WB+ (AMR-WB+=extended Adaptive Multi-Rate WideBand codec), cf. 3GPP (3GPP=Third Generation Partnership Project) technical specification number 26.290, version 6.3.0, June 2005, can switch between the two essentially different modes ACELP and TCX. In the ACELP mode a time domain signal is coded by algebraic code excitation. In the TCX mode a fast Fourier transform (FFT=fast Fourier transform) is used and the spectral values of the LPC weighted signal (from which the LPC excitation can be derived) are coded based on vector quantization.
The decision, which modes to use, can be taken by trying and decoding both options and comparing the resulting segmental signal-to-noise ratios (SNR=Signal-to-Noise Ratio).
This case is also called the closed loop decision, as there is a closed control loop, evaluating both coding performances or efficiencies, respectively, and then choosing the one with the better SNR.
It is well-known that for audio and speech coding applications a block transform without windowing is not feasible. Therefore, for the TCX mode the signal is windowed with a low overlap window with an overlap of ⅛th. This overlapping region is useful in order to fade-out a prior block or frame while fading-in the next, for example to suppress artifacts due to uncorrelated quantization noise in consecutive audio frames. This way the overhead compared to non-critical sampling is kept reasonably low and the decoding useful for the closed-loop decision reconstructs at least ⅞th of the samples of the current frame.
The AMR-WB+ introduces ⅛th of overhead in a TCX mode, i.e. the number of spectral values to be coded is ⅛th higher than the number of input samples. This provides the disadvantage of an increased data overhead. Moreover, the frequency response of the corresponding band pass filters is disadvantageous, due to the steep overlap region of ⅛th of consecutive frames.
In order to elaborate more on the code overhead and overlap of consecutive frames, FIG. 18 illustrates a definition of window parameters. The window shown in FIG. 18 has a rising edge part on the left-hand side, which is denoted with “L” and also called left overlap region, a center region which is denoted by “M”, which is also called a region of 1 or bypass part, and a falling edge part, which is denoted by “R” and also called the right overlap region. Moreover, FIG. 18 shows an arrow indicating the region “PR” of perfect reconstruction within a frame. Furthermore, FIG. 18 shows an arrow indicating the length of the transform core, which is denoted by “T”.
FIG. 19 shows a view graph of a sequence of AMR-WB+ windows and at the bottom a table of window parameters according to FIG. 18. The sequence of windows shown at the top of FIG. 19 is ACELP, TCX20 (for a frame of 20 ms duration), TCX20, TCX40 (for a frame of 40 ms duration), TCX80 (for a frame of 80 ms duration), TCX20, TCX20, ACELP, ACELP.
From the sequence of windows the varying overlapping regions can be seen, which overlap by exact ⅛th of the center part M. The table at the bottom of FIG. 19 also shows that the transform length “T” is by ⅛th larger than the region of new perfectly reconstructed samples “PR”. Moreover, it is to be noted that this is not only the case for ACELP to TCX transitions, but also for TCXx to TCXx (where “x” indicates TCX frames of arbitrary length) transitions. Thus, in each block an overhead of ⅛th is introduced, i.e. critical sampling is never achieved.
When switching from TCX to ACELP the window samples are discarded from the FFT-TCX frame in the overlapping region, as for example indicated at the top of FIG. 19 by the region labeled with 1900. When switching from ACELP to TCX the zero-input response (ZIR=zero-input response), which is also indicated by the dotted line 1910 at the top of FIG. 19, is removed at the encoder before windowing and added at the decoder for recovering. When switching from TCX to TCX frames the windowed samples are used for cross-fade. Since the TCX frames can be quantized differently, quantization error or quantization noise between consecutive frames can be different and/or independent. Therewith, when switching from one frame to the next without cross-fade, noticeable artifacts may occur, and hence, cross-fade is useful in order to achieve a certain quality.
From the table at the bottom of FIG. 19 it can be seen, that the cross-fade region grows with a growing length of the frame. FIG. 20 provides another table with illustrations of the different windows for the possible transitions in AMR-WB+. When transiting from TCX to ACELP the overlapping samples can be discarded. When transiting from ACELP to TCX, the zero-input response from the ACELP can be removed at the encoder and added at the decoder for recovering.
In the following audio coding will be illuminated, which utilizes time-domain (TD=Time-Domain) and frequency-domain (FD=Frequency-Domain) coding. Moreover, between the two coding domains, switching can be utilized. In FIG. 21, a timeline is shown during which a first frame 2101 is encoded by an FD-coder followed by another frame 2103, which is encoded by a TD-coder and which overlaps in region 2102 with the first frame 2101. The time-domain encoded frame 2103 is followed by a frame 2105, which is encoded in the frequency-domain again and which overlaps in region 2104 with the preceding frame 2103. The overlap regions 2102 and 2104 occur whenever the coding domain is switched.
The purpose of these overlap regions is to smooth out the transitions. However, overlap regions can still be prone to a loss of coding efficiency and artefacts. Therefore, overlap regions or transitions are often chosen as a compromise between some overhead of transmitted information, i.e. coding efficiency, and the quality of the transition, i.e. the audio quality of the decoded signal. To set up this compromise, care should be taken when handling the transitions and designing the transition windows 2111, 2113 and 2115 as indicated in FIG. 21.
Conventional concepts relating to managing transitions between frequency-domain and time-domain coding modes are, for example, using cross-fade windows, i.e. introducing an overhead as large as the overlap region. A cross-fading window, fading-out the preceding frame and fading-in the following frame simultaneously is utilized. This approach, due to its overhead, introduces deficiencies in a decoding efficiency, since whenever a transition takes place, the signal is not critically-sampled anymore. Critically sampled lapped transforms are for example disclosed in J. Princen, A. Bradley, “Analysis/Synthesis Filter Bank Design Based on Time Domain Aliasing Cancellation”, IEEE Trans. ASSP, ASSP-34(5):1153-1161, 1986, and are for example used in AAC (AAC=Advanced Audio Coding), cf. Generic Coding of Moving Pictures and Associated Audio: Advanced Audio Coding, International Standard 13818-7, ISO/IEC JTC1/SC29/WG11 Moving Pictures Expert Group, 1997.
Moreover, non-aliased cross-fade transitions are disclosed in Fielder, Louis D., Todd, Craig C., “The Design of a Video Friendly Audio Coding System for Distribution Applications”, Paper Number 17-008, The AES 17th International Conference High-Quality Audio Coding (August 1999) and in Fielder, Louis D., Davidson, Grant A., “Audio Coding Tools for Digital Television Distribution”, Preprint Number 5104, 108th Convention of the AES (January 2000).
WO 2008/071353 discloses a concept for switching between a time-domain and a frequency-domain encoder. The concept could be applied to any codec based on time-domain/frequency-domain switching. For example, the concept could be applied to time-domain encoding according to the ACELP mode of the AMR-WB+ codec and the AAC as an example of a frequency-domain codec. FIG. 22 shows a block diagram of a conventional decoder utilizing a frequency-domain decoder in the top branch and a time-domain decoder in the bottom branch. The frequency decoding part is exemplified by an AAC decoder, comprising a re-quantization block 2202 and an inverse modified discrete cosine transform block 2204. In AAC the modified discrete cosine transform (MDCT=Modified Discrete Cosine Transform) is used as transformation between the time-domain and the frequency-domain. In FIG. 22 the time-domain decoding path is exemplified as an AMR-WB+ decoder 2206 followed by an MDCT block 2208, in order to combine the outcome of the decoder 2206 with the outcome of the re-quantizer 2202 in the frequency-domain.
This enables a combination in the frequency-domain, whereas an overlap and add stage, which is not shown in FIG. 22, can be used after the inverse MDCT 2204, in order to combine and cross-fade adjacent blocks, without having to consider whether they had been encoded in the time-domain or the frequency-domain.
In another conventional approach which is disclosed in WO2008/071353 is to avoid the MDCT 2208 in FIG. 22, i.e. DCT-IV and IDCT-IV for the case of time-domain decoding, another approach to so-called time-domain aliasing cancellation (TDAC=Time-Domain Aliasing Cancellation) can be used. This is shown in FIG. 23. FIG. 23 shows another decoder having the frequency-domain decoder exemplified as an AAC decoder comprising a re-quantization block 2302 and an IMDCT block 2304. The time-domain path is again exemplified by an AMR-WB+ decoder 2306 and the TDAC block 2308. The decoder shown in FIG. 23 allows a combination of the decoded blocks in the time-domain, i.e. after IMDCT 2304, since the TDAC 2308 introduces the useful time aliasing for proper combination, i.e. for time aliasing cancellation, directly in the time-domain. To save some calculation and instead of using MDCT on every first and last superframe, i.e. on every 1024 samples, of each AMR-WB+ segment, TDAC may only be used in overlap zones or regions on 128 samples. The normal time domain aliasing introduced by the AAC processing may be kept, while the corresponding inverse time-domain aliasing in the AMR-WB+ parts is introduced.
Non-aliased cross-fade windows have the disadvantage, that they are not coding efficient, because they generate non-critically sampled encoded coefficients, and add an overhead of information to encode. Introducing TDA (TDA=Time Domain Aliasing) at the time domain decoder, as for example in WO 2008/071353, reduces this overhead, but could be only applied as the temporal framings of the two coders match each other. Otherwise, the coding efficiency is reduced again. Further, TDA at the decoder's side could be problematic, especially at the starting point of a time domain coder. After a potential reset, a time domain coder or decoder will usually produce a burst of quantization noise due to the emptiness of the memories of the time domain coder or decoder using for example, LPC (LPC=Linear Prediction Coding). The decoder will then take a certain time before being in a permanent or stable state and deliver a more uniform quantization noise over time. This burst error is disadvantageous since it is usually audible.