In the context of low bitrate audio and speech coding technology, several different coding techniques have traditionally been employed in order to achieve low bitrate coding of such signals with best possible subjective quality at a given bitrate. Coders for general music/sound signals aim at optimizing the subjective quality by shaping spectral (and temporal) shape of the quantization error according to a masking threshold curve which is estimated from the input signal by means of a perceptual model (“perceptual audio coding”). On the other hand, coding of speech at very low bit rates has been shown to work very efficiently when it is based on a production model of human speech, i.e. employing Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal.
As a consequence of these two different approaches, general audio coders (like MPEG-1 Layer 3, or MPEG-2/4 Advanced Audio Coding, AAC) usually do not perform as well for speech signals at very low data rates as dedicated LPC-based speech coders due to the lack of exploitation of a speech source model. Conversely, LPC-based speech coders usually do not achieve convincing results when applied to general music signals because of their inability to flexibly shape the spectral envelope of the coding distortion according to a masking threshold curve. It is the object of the present invention to provide a concept that combines the advantages of both LPC-based coding and perceptual audio coding into a single framework and thus describes unified audio coding that is efficient for both general audio and speech signals.
The following section describes a set of relevant technologies which have been proposed for efficient coding of audio and speech signals.
Perceptual Audio Coding (FIG. 9)
Traditionally, perceptual audio coders use a filterbank-based approach to efficiently code audio signals and shape the quantization distortion according to an estimate of the masking curve.
FIG. 9 shows the basic block diagram of a monophonic perceptual coding system. An analysis filterbank is used to map the time domain samples into sub sampled spectral components.
Dependent on the number of spectral components, the system is also referred to as a subband coder (small number of subbands, e.g. 32) or a filterbank-based coder (large number of frequency lines, e.g. 512). A perceptual (“psycho-acoustic”) model is used to estimate the actual time dependent masking threshold. The spectral (“subband” or “frequency domain”) components are quantized and coded in such a way that the quantization noise is hidden under the actual transmitted signal and is not perceptible after decoding. This is achieved by varying the granularity of quantization of the spectral values over time and frequency.
As an alternative to the entirely filterbank-based-based perceptual coding concept, coding based on the pre-/post-filtering approach has been proposed much more recently as shown in FIG. 10.
In [Edl00], a perceptual audio coder has been proposed which separates the aspects of irrelevance reduction (i.e. noise shaping according to perceptual criteria) and redundancy reduction (i.e. obtaining a mathematically more compact representation of information) by using a so-called pre-filter rather than a variable quantization of the spectral coefficients over frequency. The principle is illustrated in the following figure. The input signal is analyzed by a perceptual model to compute an estimate of the masking threshold curve over frequency. The masking threshold is converted into a set of pre-filter coefficients such that the magnitude of its frequency response is inversely proportional to the masking threshold. The pre-filter operation applies this set of coefficients to the input signal which produces an output signal wherein all frequency components are represented according to their perceptual importance (“perceptual whitening”). This signal is subsequently coded by any kind of audio coder which produces a “white” quantization distortion, i.e. does not apply any perceptual noise shaping. Thus, the transmission/storage of the audio signal includes both the coder's bit-stream and a coded version of the pre-filtering coefficients. In the decoder, the coder bit-stream is decoded into an intermediate audio signal which is then subjected to a post-filtering operation according to the transmitted filter coefficients. Since the post-filter performs the inverse filtering process relative to the pre-filter, it applies a spectral weighting to its input signal according to the masking curve. In this way, the spectrally flat (“white”) coding noise appears perceptually shaped at the decoder output, as intended.
Since in such a scheme perceptual noise shaping is achieved via the pre-/post-filtering step rather than frequency dependent quantization of spectral coefficients, the concept can be generalized to include non-filterbank-based coding mechanism for representing the pre-filtered audio signal rather than a filterbank-based audio coder. In [Sch02] this is shown for time domain coding kernel using predictive and entropy coding stages.    [Edl00] B. Edler, G. Schuller: “Audio coding using a psycho-acoustic pre- and post-filter”, ICASSP 2000, Volume 2, 5-9 Jun. 2000 Page(s):II881-II884 vol. 2    [Sch02] G. Schuller, B. Yu, D. Huang, and B. Edler, “Perceptual Audio Coding using Adaptive Pre- and Post-Filters and Lossless Compression”, IEEE Transactions on Speech and Audio Processing, September 2002, pp. 379-390
In order to enable appropriate spectral noise shaping by using pre-/post-filtering techniques, it is important to adapt the frequency resolution of the pre-/post-filter to that of the human auditory system. Ideally, the frequency resolution would follow well-known perceptual frequency scales, such as the BARK or ERB frequency scale [Zwi]. This is especially desirable in order to minimize the order of the pre-/post-filter model and thus the associated computational complexity and side information transmission rate.
The adaptation of the pre-/post-filter frequency resolution can be achieved by the well-known frequency warping concept [KHL97]. Essentially, the unit delays within a filter structure are replaced by (first or higher order) allpass filters which leads to a non-uniform deformation (“warping”) of the frequency response of the filter. It has been shown that even by using a first-order allpass filter
      (          e      .      g      .                          ⁢                                    z                          -              1                                -          λ                          1          -                      λ            ⁢                                                  ⁢                          z                              -                1                                                          )    ,a quite accurate approximation of perceptual frequency scales is possible by an appropriate choice of the allpass coefficients [SA99]. Thus, most known systems do not make use of higher-order allpass filters for frequency warping. Since a first-order allpass filter is fully determined by a single scalar parameter (which will be referred to as the “warping factor”−1<λ<1), which determines the deformation of the frequency scale. For example, for a warping factor of λ=0, no deformation is effective, i.e. the filter operates on the regular frequency scale. The higher the warping factor is chosen, the more frequency resolution is focused on the lower frequency part of the spectrum (as it is necessary to approximate a perceptual frequency scale), and taken away from the higher frequency part of the spectrum). This is shown in FIG. 5 for both positive and negative warping coefficients:
Using a warped pre-/post-filter, audio coders typically use a filter order between 8 and 20 at common sampling rates like 48 kHz or 44.1 kHz [WSKH05]
Several other applications of warped filtering have been described, e.g. modeling of room impulse responses [HKS00] and parametric modeling of a noise component in the audio signal (under the equivalent name Laguerre/Kauz filtering) [SOB03]    [Zwi] Zwicker, E. and H. Fastl, “Psychoacoustics, Facts and Models”, Springer Verlag, Berlin    [KHL97] M. Karjalainen, A. Härmä, U. K. Laine, “Realizable warped IIR filters and their properties”, IEEE I-CASSP 1997, pp. 2205-2208, vol. 3    [SA99] J. O. Smith, J. S. Abel, “BARK and ERE Bilinear Transforms”, IEEE Transactions on Speech and Audio Processing, Volume 7, Issue 6, November 1999, pp. 697-708    [HKS00] Härmä, Aki; Karjalainen, Matti; Savioja, Lauri; Välimäki, Vesa; Laine, Unto K.; Huopaniemi, Jyri, “Frequency-Warped Signal Processing for Audio Applications”, Journal of the AES, Volume 48 Number 11 pp. 1011-1031; November 2000    [SOB03] E. Schuijers, W. Oomen, B. den Brinker, J. Breebaart, “Advances in Parametric Coding for High-Quality Audio”, 114th Convention, Amsterdam, The Netherlands 2003, preprint 5852    [WSKH05] S. Wabnik, C. Schuller, U. Krämer, J. Hirschfeld, “Frequency Warping in Low Delay Audio Coding”, IEEE International Conference on Acoustics, Speech, and Signal Processing, Mar. 18-23, 2005, Philadelphia, Pa., USALPC-Based Speech Coding
Traditionally, efficient speech coding has been based on Linear Predictive Coding (LPC) to model the resonant effects of the human vocal tract together with an efficient coding of the residual excitation signal [VM06]. Both LPC and excitation parameters are transmitted from the encoder to the decoder. This principle is illustrated in the following figure (encoder and decoder).
Over time, many methods have been proposed with respect to an efficient and perceptually convincing representation of the residual (excitation) signal, such as Multi-Pulse Excitation (MPE), Regular Pulse Excitation (RPE), and Code-Excited Linear Prediction (CELP).
Linear Predictive Coding attempts to produce an estimate of the current sample value of a sequence based on the observation of a certain number of past values as a linear combination of the past observations. In order to reduce redundancy in the input signal, the encoder LPC filter “whitens” the input signal in its spectral envelope, i.e. its frequency response is a model of the inverse of the signal's spectral envelope. Conversely, the frequency response of the decoder LPC filter is a model of the signal's spectral envelope. Specifically, the well-known auto-regressive (AR) linear predictive analysis is known to model the signal's spectral envelope by means of an all-pole approximation.
Typically, narrow band speech coders (i.e. speech coders with a sampling rate of 8 kHz) employ an LPC filter with an order between 8 and 12. Due to the nature of the LPC filter, a uniform frequency resolution is effective across the full frequency range. This does not correspond to a perceptual frequency scale.
Warped LPC Coding
Noticing that a non-uniform frequency sensitivity, as it is offered by warping techniques, may offer advantages also for speech coding, there have been proposals to substitute the regular LPC analysis by warped predictive analysis. Specifically, [TML94] proposes a speech coder that models the speech spectral envelope by cepstral coefficients c(m) which are updated sample by sample according to the time-varying input signal. The frequency scale of the model is adapted to approximate the perceptual MEL scale [Zwi] by using a first order all-pass filter instead of the usual unit delay. A fixed value of 0.31 for the warping coefficient is used at the coder sampling rate of 8 kHz. The approach has been developed further to include a CELP coding core for representing the excitation signal in [KTK95], again using a fixed value of 0.31 for the warping coefficient at the coder sampling rate of 8 kHz.
Even though the authors claim good performance of the proposed scheme, state-of-the-art speech coding did not adopt the warped predictive coding techniques.
Other combinations of warped LPC and CELP coding are known, e.g. [HLM99] for which a warping factor of 0.723 is used at a sampling rate of 44.1 kHz.    [TMK94] K. Tokuda, H. Matsumura, T. Kobayashi and S. Imai, “Speech coding based on adaptive mel-cepstral analysis,” Proc. IEEE ICASSP'94, pp. 197-200, Apr. 1994.    [KTK95] K. Koishida, K. Tokuda, T. Kobayashi and S. Imai, “CELP coding based on mel-cepstral analysis,” Proc. IEEE ICASSP'95, pp. 33-36, 1395.    [HLM99] Aki Härmä, Unto K. Laine, Matti Karjalainen, “Warped low-delay CELP for wideband audio coding”, 17th International AES Conference, Florence, Italy, 1999    [VM06] Peter Vary, Rainer Martin, “Digital Speech Transmission: Enhancement, Coding and Error Concealment”, published by John Wiley & Sons, LTD, 2006, ISBN 0-471-56018-9Generalized Warped LPC Coding
The idea of performing speech coding on a warped frequency scale was developed further over the following years. Specifically, it was noticed that a full conventional warping of the spectral analysis according to a perceptual frequency scale may not be appropriate to achieve best possible quality for coding speech signals. Therefore, a Mel-generalized cepstral analysis was proposed in [KTK96] which allows to fade the characteristics of the spectral model between that of the previously proposed mel-cepstral analysis (with a fully warped frequency scale and a cepstral analysis), and the characteristics of a traditional LPC model (with a uniform frequency scale and an all-pole model of the signal's spectral envelope). Specifically, the proposed generalized analysis has two parameters that control these characteristics:                The parameter γ, −1≦γ≦0 continuously fades between a cepstral-type and an LPC-type of analysis, where γ=0 corresponds to a cepstral-type analysis and γ=−1 corresponds to an LPC-type analysis.        The parameter α, |α|<1 is the warping factor. A value of α=0 corresponds to a fully uniform frequency scale (like in standard LPC), and a value of α=0.31 corresponds to a full perceptual frequency warping.        
The same concept was applied to coding of wideband speech (at a sampling rate of 16 kHz) in [KHT98]. It should be noted that the operating point (γ; α) for such a generalized analysis is chosen a priori and not varied over time.    [KTK96] K. Koishida, K. Tokuda, T. Kobayashi and S. Imai, “CELP coding system based on mel-generalized cepstral analysis,” Proc. ICSLP'96, pp. 318-321, 1996.    [KHT98] K. Koishida, G. Hirabayashi, K. Tokuda, and T. Kobayashi, “A wideband CELP speech coder at 16 kbit/s based on mel-generalized cepstral analysis,” Proc. IEEE ICASSP'98, pp. 161-164, 1998.
A structure comprising both an encoding filter and two alternate coding kernels has been described previously in the literature (“WB-AMR+ Coder” [BLS05]). There does not exist any notion of using a warped filter, or even a filter with time-varying warping characteristics.    [BLS05] B. Bessette, R. Lefebvre, R. Salami, “UNIVERSAL SPEECH/AUDIO CODING USING HYBRID ACELP/TCX TECHNIQUES,” Proc. IEEE ICASSP 2005, pp. 301-304, 2005.
The disadvantage of all those prior art techniques is that they all are dedicated to a specific audio coding algorithm. Any speech coder using warping filters is optimally adapted for speech signals, but commits compromises when it comes to encoding of general audio signals such as music signals.
On the other hand, general audio coders are optimized to perfectly hide the quantization noise below the masking threshold, i.e., are optimally adapted to perform an irrelevance reduction. To this end, they have a functionality for accounting for the non-uniform frequency resolution of the human hearing mechanism. However, due to the fact that they are general audio encoders, they cannot specifically make use of any a-priori knowledge on a specific kind of signal patterns which are the reason for obtaining the very low bitrates known from e.g. speech coders.
Furthermore, many speech coders are time-domain encoders using fixed and variable codebooks, while most general audio coders are, due to the masking threshold issue, which is a frequency measure, filterbank-based encoders so that it is highly problematic to introduce both coders into a single encoding/decoding frame in an efficient manner, although there also exist time-domain based general audio encoders.