The present invention relates to source coding, and particularly, to audio source coding, in which an audio signal is processed by at least two different audio coders having different coding algorithms.
In the context of low bitrate audio and speech coding technology, several different coding techniques have traditionally been employed in order to achieve low bitrate coding of such signals with best possible subjective quality at a given bitrate. Coders for general music/sound signals aim at optimizing the subjective quality by shaping spectral (and temporal) shape of the quantization error according to a masking threshold curve which is estimated from the input signal by means of a perceptual model (“perceptual audio coding”). On the other hand, coding of speech at very low bitrates has been shown to work very efficiently when it is based on a production model of human speech, i.e. employing Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal.
As a consequence of these two different approaches, general audio coders (like MPEG-1 Layer 3, or MPEG-2/4 Advanced Audio Coding, AAC) usually do not perform as well for speech signals at very low data rates as dedicated LPC-based speech coders due to the lack of exploitation of a speech source model. Conversely, LPC-based speech coders usually do not achieve convincing results when applied to general music signals because of their inability to flexibly shape the spectral envelope of the coding distortion according to a masking threshold curve. In the following, embodiments are described which provide a concept that combines the advantages of both LPC-based coding and perceptual audio coding into a single framework and thus describe unified audio coding that is efficient for both general audio and speech signals.
Traditionally, perceptual audio coders use a filterbank-based approach to efficiently code audio signals and shape the quantization distortion according to an estimate of the masking curve.
FIG. 16a shows the basic block diagram of a monophonic perceptual coding system. An analysis filterbank 1600 is used to map the time domain samples into subsampled spectral components. Dependent on the number of spectral components, the system is also referred to as a subband coder (small number of subbands, e.g. 32) or a transform coder (large number of frequency lines, e.g. 512). A perceptual (“psychoacoustic”) model 1602 is used to estimate the actual time dependent masking threshold. The spectral (“subband” or “frequency domain”) components are quantized and coded 1604 in such a way that the quantization noise is hidden under the actual transmitted signal, and is not perceptible after decoding. This is achieved by varying the granularity of quantization of the spectral values over time and frequency.
The quantized and entropy-encoded spectral coefficients or subband values are, in addition with side information, input into a bitstream formatter 1606, which provides an encoded audio signal which is suitable for being transmitted or stored. The output bitstream of block 1606 can be transmitted via the Internet or can be stored on any machine readable data carrier.
On the decoder-side, a decoder input interface 1610 receives the encoded bitstream. Block 1610 separates entropy-encoded and quantized spectral/subband values from side information. The encoded spectral values are input into an entropy-decoder such as a Huffman decoder which is positioned between 1610 and 1620. The output of this entropy decoder is quantized spectral values. These quantized spectral values are input into a re-quantizer which performs an “inverse” quantization as indicated at 1620 in FIG. 16a. The output of block 1620 is input into a synthesis filterbank 1622, which performs a synthesis filtering including a frequency/time transform and, typically, a time domain aliasing cancellation operation such as overlap and add and/or a synthesis-side windowing operation to finally obtain the output audio signal.
FIGS. 16b, 16c indicate an alternative to the entire filterbank based perceptual coding concept of FIG. 16a, in which a pre-filtering approach on the encoder-side and a post-filtering approach on the decoder-side are implemented.
In [Edl00], a perceptual audio coder has been proposed which separates the aspects of irrelevance reduction (i.e. noise shaping according to perceptual criteria) and redundancy reduction (i.e. obtaining a mathematically more compact representation of information) by using a so-called pre-filter rather than a variable quantization of the spectral coefficients over frequency. The principle is illustrated in FIG. 16b. The input signal is analyzed by a perceptual model 1602 to compute an estimate of the masking threshold curve over frequency. The masking threshold is converted into a set of pre-filter coefficients such that the magnitude of its frequency response is inversely proportional to the masking threshold. The pre-filter operation applies this set of coefficients to the input signal which produces an output signal in which all frequency components are represented according to their perceptual importance (“perceptual whitening”). This signal is subsequently coded by any kind of audio coder 1632 which produces a “white” quantization distortion, i.e. does not apply any perceptual noise shaping. The transmission/storage of the audio signal include both, the coder's bitstream and a coded version of the pre-filtering coefficients. In the decoder of FIG. 16c, the coder bitstream is decoded (1634) into the perceptually whitened audio signal which contains additive white quantization noise. This signal is then subjected to a post-filtering operation 1640 according to the transmitted filter coefficients. Since the post-filter performs the inverse filtering process relative to the pre-filter, it reconstructs the original audio input signal from the perceptually whitened signal. The additive white quantization noise is spectrally shaped like the masking curve by the post-filter and thus appears perceptually colored at the decoder output, as intended.
Since in such a scheme perceptual noise shaping is achieved via the pre-/post-filtering step rather than frequency dependent quantization of spectral coefficients, the concept can be generalized to include non-filterbank-based coding mechanism for representing the pre-filtered audio signal rather than a filterbank-based audio coder. In [Sch02] this is shown for time domain coding kernel using predictive and entropy coding stages.
In order to enable appropriate spectral noise shaping by using pre-/post-filtering techniques, it is important to adapt the frequency resolution of the pre-/post-filter to that of the human auditory system. Ideally, the frequency resolution would follow well-known perceptual frequency scales, such as the BARK or ERB frequency scale [Zwi]. This is especially desirable in order to minimize the order of the pre-/post-filter model and thus the associated computational complexity and side information transmission rate.
The adaptation of the pre-/post-filter frequency resolution can be achieved by the well-known frequency warping concept [KHL97]. Essentially, the unit delays within a filter structure are replaced by (first or higher order) allpass filters which leads to a non-uniform deformation (“warping”) of the frequency response of the filter. It has been shown that even by using a first-order allpass filter, e.g.
                    z                  -          1                    -      λ              1      -              λ        ⁢                                  ⁢                  z                      -            1                                ,a quite accurate approximation of perceptual frequency scales is possible by an appropriate choice of the allpass coefficients [SA99]. Thus, most known systems do not make use of higher-order allpass filters for frequency warping. Since a first-order allpass filter is fully determined by a single scalar parameter (which will be referred to as the “warping factor”−1 <□<1), which determines the deformation of the frequency scale. For example, for a warping factor of □=0, no deformation is effective, i.e. the filter operates on the regular frequency scale. The higher the warping factor is chosen, the more frequency resolution is focused on the lower frequency part of the spectrum (as it may be used to approximate a perceptual frequency scale), and taken away from the higher frequency part of the spectrum).
Using a warped pre-/post-filter, audio coders typically use a filter order between 8 and 20 at common sampling rates like 48 kHz or 44.1 kHz [WSKH05].
Several other applications of warped filtering have been described, e.g. modeling of room impulse responses [HKS00] and parametric modeling of a noise component in the audio signal (under the equivalent name Laguerre/Kauz filtering) [SOB03]
Traditionally, efficient speech coding has been based on Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal [VM06]. Both LPC and excitation parameters are transmitted from the encoder to the decoder. This principle is illustrated in FIGS. 17a and 17b. 
FIG. 17a indicates the encoder-side of an encoding/decoding system based on linear predictive coding. The speech input is input into an LPC analyzer 1701 which provides, at its output, LPC filter coefficients. Based on these LPC filter coefficients, an LPC filter 1703 is adjusted. The LPC filter outputs a spectrally whitened audio signal which is also termed “prediction error signal”. This spectrally whitened audio signal is input into a residual/excitation coder 1705 which generates excitation parameters. Thus, the speech input is encoded into excitation parameters on the one hand, and LPC coefficients on the other hand.
On the decoder-side illustrated in FIG. 17b, the excitation parameters are input into an excitation decoder 1707 which generates an excitation signal which can be input into an inverse LPC filter. The inverse LPC filter is adjusted using the transmitted LPC filter coefficients. Thus, the inverse LPC filter 1709 generates a reconstructed or synthesized speech output signal.
Over time, many methods have been proposed with respect to an efficient and perceptually convincing representation of the residual (excitation) signal, such as Multi-Pulse Excitation (MPE), Regular Pulse Excitation (RPE), and Code-Excited Linear Prediction (CELP).
Linear Predictive Coding attempts to produce an estimate of the current sample value of a sequence based on the observation of a certain number of past values as a linear combination of the past observations. In order to reduce redundancy in the input signal, the encoder LPC filter “whitens” the input signal in its spectral envelope, i.e. it is a model of the inverse of the signal's spectral envelope. Conversely, the decoder LPC filter is a model of the signal's spectral envelope. Specifically, the well-known auto-regressive (AR) linear predictive analysis is known to model the signal's spectral envelope by means of an all-pole approximation.
Typically, narrow band speech coders (i.e. speech coders with a sampling rate of 8 kHz) employ an LPC filter with an order between 8 and 12. Due to the nature of the LPC filter, a uniform frequency resolution is effective across the full frequency range. This does not correspond to a perceptual frequency scale.
Noticing that a non-uniform frequency sensitivity, as it is offered by warping techniques, may offer advantages also for speech coding, there have been proposals to substitute the regular LPC analysis by warped predictive analysis, e.g. [TMK94] [KTK95]. Other combinations of warped LPC and CELP coding are known, e.g. from [HLM99].
In order to combine the strengths of traditional LPC/CELP-based coding (best quality for speech signals) and the traditional filterbank-based perceptual audio coding approach (best for music), a combined coding between these architectures has been proposed. In the AMR-WB+ coder [BLS05] two alternate coding kernels operate on an LPC residual signal. One is based on ACELP (Algebraic Code Excited Linear Prediction) and thus is extremely efficient for coding of speech signals. The other coding kernel is based on TCX (Transform Coded Excitation), i.e. a filterbank based coding approach resembling the traditional audio coding techniques in order to achieve good quality for music signals. Depending on the characteristics of the input signal signals, one of the two coding modes is selected for a short period of time to transmit the LPC residual signal. In this way, frames of 80 ms duration can be split into subframes of 40 or 20 ms in which a decision between the two coding modes is made.
A limitation of this approach is that the process is based on a hard switching decision between two coders/coding schemes which possess extremely different characteristics regarding the type of introduced coding distortion. This hard switching process may cause annoying discontinuities in perceived signal quality when switching from one mode to another. For example, when a speech signal is slowly cross-faded into a music signal (such as after an announcement in a broadcasting program), the point of switching may be detectable. Similarly, for speech over music (like for announcements with music background), the hard switching may become audible. With this architecture, it is thus hard to obtain a coder which can smoothly fade between the characteristics of the two component coders.
Recently, also a combination of switched coding has been described that permits the filterbank-based coding kernel to operate on a perceptually weighted frequency scale by fading the coder's filter between a traditional LPC mode (as it is appropriate for CELP-based speech coding) and a warped mode which resembles perceptual audio coding based on pre-/post-filtering as discussed on EP 1873754.
Using a filter with variable frequency warping, it is possible to build a combined speech/audio coder which achieves both high speech and audio coding quality in the following way as indicated in FIG. 17c: 
The decision about the coding mode to be used (“Speech mode” or “Music mode”) is performed in a separate module 1726 by carrying out an analysis of the input signal and can be based on known techniques for discriminating speech signals from music. As a result, the decision module produces a decision about the coding mode/and an associated optimum warping factor for the filter 1722. Furthermore, depending on this decision, it determines a set of suitable filter coefficients which are appropriate for the input signal at the chosen coding mode, i.e. for coding of speech, an LPC analysis is performed (with no warping, or a low warping factor) whereas for coding of music, a masking curve is estimated and its inverse is converted into warped spectral coefficients.
The filter 1722 with the time varying warping characteristics is used as a common encoder/decoder filter and is applied to the signal depending on the coding mode decision/warping factor and the set of filter coefficients produced by the decision module.
The output signal of the filtering stage is coded by either a speech coding kernel 1724 (e.g. CELP coder) or a generic audio coder kernel 1726 (e.g. a filterbank-based coder, or a predictive audio coder), or both, depending on the coding mode.
The information to be transmitted/stored comprises the coding mode decision (or an indication of the warping factor), the filter coefficients in some coded form, and the information delivered by the speech/excitation and the generic audio coder.
In the corresponding decoder, the outputs of the residual/excitation decoder and the generic audio decoder are added up and the output is filtered by the time varying warped synthesis filter, based on the coding mode, warping factor and filter coefficients.
Due to the hard switching decision between two coding modes, the scheme is, however, still subject to similar limitations as the switched CELP/filterbank-based coding as they were described previously. With this architecture, it is hard to obtain a coder which can smoothly fade between the characteristics of the two component coders.
Another way of combining a speech coding kernel with a generic perceptual audio coder is used for MPEG-4 Large-Step Scalable Audio Coding [Gri97] [Her02]. The idea of scalable coding is to provide coding/decoding schemes and bitstream formats that allow meaningful decoding of subsets of a full bitstream, resulting in a reduced quality output signal. In this, the transmitted/decoded data rate can be adapted to the instantaneous transmission channel capacity without a re-encoding of the input signal.
The structure of an MPEG-4 large-step scalable audio coder is depicted by FIG. 18 [Gri97]. This configuration comprises both a so-called core coder 1802 and several enhancement layers based on perceptual audio coding modules 1804. The core coder (typically a narrow band speech coder) operates at a lower sampling rate than the subsequent enhancement layers. The scalable combination of these components works as follows:
The input signal is down-sampled 1801 and encoded by the core coder 1802. The produced bitstream constitutes the core layer portion 1804 of the scalable bitstream. It is decoded locally 1806 and upsampled 1808 to match the sampling rate of the perceptual enhancement layers and passed through the analysis filterbank (MDCT) 1810.
In a second signal path, the delay (1812) compensated input signal is passed through the analysis filterbank 1814 and used to compute the residual coding error signal.
The residual signal is passed through a Frequency Selective Switch (FSS) tool 1816 which permits to fall back to the original signal on a scalefactor band basis if this can be coded more efficiently than the residual signal.
The spectral coefficients are quantized/coded by an AAC coding kernel 1804, leading to an enhancement layer bitstream 1818.
Further stages of refinement (enhancement layers) by re-coding of the residual coding error signal can follow.
FIG. 19 illustrates the structure of the associated core-based scalable decoder. The composite bit-stream is decomposed 1902 into the individual coding layers. Decoding 1904 of the core coder bitstream (e.g. a speech coder bitstream) is then performed and its output signal may be presented via an optional post filter stage. In order to use the core decoder signal within the scalable decoding process, it is upsampled 1908 to the sampling rate of the scalable coder, delay compensated 1910 with respect to the other layers and de-composed by the coder analysis filterbank (MDCT) 1912.
Higher layer bitstreams are then decoded 1916 by applying the AAC noiseless decoding and inverse quantization, and summing up 1918 all spectral coefficient contributions. A Frequency Selective Switch tool 1920 combines the resulting spectral coefficients with the contribution from the core layer by selecting either the sum of them or only the coefficients originating from the enhancement layers as signaled from the encoder. Finally, the result is mapped back to a time domain representation by the synthesis filterbank (IMDCT) 1922.
As a general characteristic, the speech coder (core coder) is used and decoded in this configuration. Only if a decoder has access not only to the core layer of the bitstream but also to one or more enhancement layers, also contributions from the perceptual audio coders in the enhancement layers are transmitted which can provide a good quality for non-speech/music signals.
Consequently, this scalable configuration includes an active layer containing a speech coder which leads to some drawbacks regarding its performance to provide best overall quality for both speech and audio signals:
If the input signal is a signal that predominantly consists of speech, the perceptual audio coder in the enhancement layer(s) code a residual/difference signal that has properties that may be quite different from that of regular audio signals and are thus hard to code for this type of coder. As one example, the residual signal may contain components which are impulsive of nature and therefore provoke pre-echoes when coded with a filterbank-based perceptual audio coder.
If the input signal is not predominantly speech, the residual signal frequently necessitates more bitrate to code than the input signal. In these cases, the FSS selects the original signal for coding by the enhancement layer rather than the residual signal. Consequently, the core layer does not contribute to the output signal and the bitrate of the core layer is spent in vain since it does not contribute to an improvement of the overall quality. In other words, in such cases the result sounds worse that if the entire bitrate would have simply been allocated to a perceptual audio coder only.
In http://www.hitech-projects.com/euprojects/ardor/summary.htm
the ARDOR (Adaptive Rate-Distortion Optimised sound codeR) codec is described as follows:
Within the project, a codec is created that encodes generic audio with the most appropriate combination of signal models, given the imposed constraints as well as the available subcoders. The work can be divided into three parts corresponding to the three codec components as illustrated in FIG. 20.
A rate-distortion-theory based optimization mechanism 2004 that configures the ARDOR codec such that it operates most efficiently given the current, time-varying, constraints and type of input signal. For this purpose it controls: a set of ‘subcoding’ strategies 2000, each of which is highly efficient for encoding a particular type of input-signal component, e.g., tonal, noisy, or transient signals. The appropriate rate and signal-component allocation for each particular subcoding strategy is based on: an advanced, new perceptual distortion measure 2002 that provides a perceptual criterion for the rate-distortion optimization mechanism. In other words, a perceptual model, which is based on state-of-the-art knowledge about the human auditory system, provides the optimization mechanism with information about the perceptual relevance of different parts of the sound. The optimization algorithm could for example decide to leave out information that is perceptually irrelevant. Consequently, the original signal cannot be restored, but the auditory system will not be able to perceive the difference.
The above discussion of several known systems underlines that there does not yet exist an optimum encoding strategy which, on the one hand provides optimum quality for general audio signals as well as speech signals, and which on the other hand, provides a low bitrate for all kinds of signals. Particularly, the scalable approach as discussed in connection with FIG. 18 and FIG. 19 which has been standardized in MPEG-4 continuously processes the whole audio signal using a speech coder core without paying attention to the audio signal and, specifically, to the source of the audio signal. Therefore, when the audio signal is not speech-like, the core encoder will introduce heavy coding artifacts and, consequently, the frequency selective switch tool 1816 in FIG. 18 will make sure that the full audio signal is encoded using the AAC encoder core 1804. Thus, in this instance, the bitstream includes the useless output of the speech core coder, and additionally includes the perceptually encoded representation of the audio signal. This not only results in a waste of transmission bandwidth, but also results in a high and useless power consumption, which is particularly problematic when the encoding concept is to be implemented in mobile devices which are battery-powered and have limited resources of energy.
Generally stated, the transform-based perceptual encoder operates without paying attention to the source of the audio signal, which results in the fact that, for all available sources of signals, the perceptual audio encoder (when having a moderate bit rate) can generate an output without too many coding artifacts, but for non-stationary signal portions, the bitrate increases, since the masking threshold does not mask as efficient as in stationary sounds. Furthermore, the inherent compromise between time resolution and frequency resolution in transform-based audio encoders renders this coding system problematic for transient or impulse-like signal components, since these signal components would necessitate a high time resolution and would not necessitate a high frequency resolution.
The speech coder, however, is a prominent example for a coding concept, which is heavily based on a source model. Thus, a speech coder resembles a model of the speech source, and is, therefore, in the position to provide a highly efficient parametric representation for signals originating from a sound source similar to the source model represented by the coding algorithm. For sounds originating from sources which do not coincide with the speech coder source model, the output will include heavy artifacts or, when the bitrate is allowed to increase, will show up a bitrate which is drastically increased and substantially higher than a bitrate of a general audio coder.