The present invention is related to audio coding and, particularly, to audio coding relying on switched audio encoders and correspondingly controlled audio decoders, particularly suitable for low-delay applications.
Several audio coding concepts relying on switched codecs are known. One well-known audio coding concept is the so-called Extended Adaptive Multi-Rate-Wideband (AMR-WB+) codec, as described in 3GPP TS 26.290 B10.0.0 (2011-03). The AMR-WB+ audio codec contains all the AMR-WB speech codec modes 1 to 9 and AMR-WB VAD and DTX. AMR-WB+ extends the AMR-WB codec by adding TCX, bandwidth extension, and stereo.
The AMR-WB+ audio codec processes input frames equal to 2048 samples at an internal sampling frequency Fs. The internal sampling frequency is limited to the range of 12800 to 38400 Hz. The 2048 sample frames are split into two critically sampled equal frequency bands. This results in two super-frames of 1024 samples corresponding to the low frequency (LF) and high frequency (HF) bands. Each super-frame is divided into four 256-sample frames. Sampling at the internal sampling rate is obtained by using a variable sampling conversion scheme, which re-samples the input signal.
The LF and HF signals are then encoded using two different approaches: the LF is encoded and decoded using the “core” encoder/decoder based on switched ACELP and transform coded excitation (TCX). In ACELP mode, the standard AMR-WB codec is used. The HF signal is encoded with relatively few bits (16 bits/frame) using a bandwidth extension (BWE) method. The parameters transmitted from encoder to decoder are the mode selection bits, the LF parameters and the HF parameters. The parameters for each 1024 samples super-frame are decomposed into four packets of identical size. When the input signal is stereo, the left and right channels are combined into a mono-signal for ACELP/TCX encoding, whereas the stereo encoding receives both input channels. On the decoder-side, the LF and HF bands are decoded separately after which they are combined in a synthesis filterbank. If the output is restricted to mono only, the stereo parameters are omitted and the decoder operates in mono mode. The AMR-WB+ codec applies LP analysis for both the ACELP and TCX modes when encoding the LF signal. The LP coefficients are interpolated linearly at every 64-samples subframe. The LP analysis window is a half-cosine of length 384 samples. To encode the core mono-signal, either an ACELP or TCX coding is used for each frame. The coding mode is selected based on a closed-loop analysis-by-synthesis method. Only 256-sample frames are considered for ACELP frames, whereas frames of 256, 512 or 1024 samples are possible in TCX mode. The window used for LPC analysis in AMR-WB+ is illustrated in FIG. 5b. A symmetric LPC analysis window with look-ahead of 20 ms is used. Look-ahead means that, as illustrated in FIG. 5b, the LPC analysis window for the current frame illustrated at 500 not only extends within the current frame indicated between 0 and 20 ms in FIG. 5b illustrated by 502, but extends into the future frame between 20 and 40 ms. This means that, by using this LPC analysis window, an additional delay of 20 ms, i.e., a whole future frame is necessitated. Therefore, the look-ahead portion indicated at 504 in FIG. 5b contributes to the systematic delay associated with the AMR-WB+ encoder. In other words, a future frame must be fully available so that the LPC analysis coefficients for the current frame 502 can be calculated.
FIG. 5a illustrates a further encoder, the so-called AMR-WB coder and, particularly, the LPC analysis window used for calculating the analysis coefficients for the current frame. Once again, the current frame extends between 0 and 20 ms and the future frame extends between 20 and 40 ms. In contrast to FIG. 5b, the LPC analysis window of AMR-WB indicated at 506 has a look-ahead portion 508 of 5 ms only, i.e., the time distance between 20 ms and 25 ms. Hence, the delay introduced by the LPC analysis is reduced substantially with respect to FIG. 5a. On the other hand, however, it has been found that a larger look-ahead portion for determining the LPC coefficients, i.e., a larger look-ahead portion for the LPC analysis window results in better LPC coefficients and, therefore, a smaller energy in the residual signal and, therefore, a lower bitrate, since the LPC prediction better fits the original signal.
While FIGS. 5a and 5b relate to encoders having only a single analysis window for determining the LPC coefficients for one frame, FIG. 5c illustrates the situation for the G.718 speech coder. The G718 (06-2008) specification is related to transmission systems and media digital systems and networks and, particularly, describes digital terminal equipment and, particularly, a coding of voice and audio signals for such equipment. Particularly, this standard is related to robust narrow-band and wideband embedded variable bitrate coding of speech and audio from 8-32 kbit/s as defined in recommendation ITU-T G718. The input signal is processed using 20 ms frames. The codec delay depends on the sampling rate of input and output. For a wideband input and wideband output, the overall algorithmic delay of this coding is 42.875 ms. It consists of one 20-ms frame, 1.875 ms delay of input and output re-sampling filters, 10 ms for the encoder look-ahead, one ms of post-filtering delay and 10 ms at the decoder to allow for the overlap-add operation of higher layer transform coding. For a narrow band input and a narrow band output, higher layers are not used, but the 10 ms decoder delay is used to improve the coding performance in the presence of frame erasures and for music signals. If the output is limited to layer 2, the codec delay can be reduced by 10 ms. The description of the encoder is as follows. The lower two layers are applied to a pre-emphasized signal sampled at 12.8 kHz, and the upper three layers operate in the input signal domain sampled at 16 kHz. The core layer is based on the code-excited linear prediction (CELP) technology, where the speech signal is modeled by an excitation signal passed through a linear prediction (LP) synthesis filter representing the spectral envelope. The LP filter is quantized in the immittance spectral frequency (ISF) domain using a switched-predictive approach and the multi-stage vector quantization. The open-loop pitch analysis is performed by a pitch-tracking algorithm to ensure a smooth pitch contour. Two concurrent pitch evolution contours are compared and the track that yields the smoother contour is selected in order to make the pitch estimation more robust. The frame level pre-processing comprises a high-pass filtering, a sampling conversion to 12800 samples per second, a pre-emphasis, a spectral analysis, a detection of narrow-band inputs, a voice activity detection, a noise estimation, noise reduction, linear prediction analysis, an LP to ISF conversion, and an interpolation, a computation of a weighted speech signal, an open-loop pitch analysis, a background noise update, a signal classification for a coding mode selection and frame erasure concealment. The layer 1 encoding using the selected encoding type comprises an unvoiced coding mode, a voiced coding mode, a transition coding mode, a generic coding mode, and a discontinuous transmission and comfort noise generation (DTX/CNG).
A long-term prediction or linear prediction (LP) analysis using the auto-correlation approach determines the coefficients of the synthesis filter of the CELP model. In CELP, however, the long-term prediction is usually the “adaptive-codebook” and so is different from the linear-prediction. The linear-prediction can, therefore, be regarded more a short-term prediction. The auto-correlation of windowed speech is converted to the LP coefficients using the Levinson-Durbin algorithm. Then, the LPC coefficients are transformed to the immitance spectral pairs (ISP) and consequently to immitance spectral frequencies (ISF) for quantization and interpolation purposes. The interpolated quantized and unquantized coefficients are converted back to the LP domain to construct synthesis and weighting filters for each subframe. In case of encoding of an active signal frame, two sets of LP coefficients are estimated in each frame using the two LPC analysis windows indicated at 510 and 512 in FIG. 5c. Window 512 is called the “mid-frame LPC window”, and window 510 is called the “end-frame LPC window”. A look-ahead portion 514 of 10 ms is used for the frame-end auto-correlation calculation. The frame structure is illustrated in FIG. 5c. The frame is divided into four subframes, each subframe having a length of 5 ms corresponding to 64 samples at a sampling rate of 12.8 kHz. The windows for frame-end analysis and for mid-frame analysis are centered at the fourth subframe and the second subframe, respectively as illustrated in FIG. 5c. A Hamming window with the length of 320 samples is used for windowing. The coefficients are defined in G.718, Section 6.4.1. The auto-correlation computation is described in Section 6.4.2. The Levinson-Durbin algorithm is described in Section 6.4.3, the LP to ISP conversion is described in Section 6.4.4, and the ISP to LP conversion is described in Section 6.4.5.
The speech encoding parameters such as adaptive codebook delay and gain, algebraic codebook index and gain are searched by minimizing the error between the input signal and the synthesized signal in the perceptually weighted domain. Perceptually weighting is performed by filtering the signal through a perceptual weighting filter derived from the LP filter coefficients. The perceptually weighted signal is also used in open-loop pitch analysis.
The G.718 encoder is a pure speech coder only having the single speech coding mode. Therefore, the G.718 encoder is not a switched encoder and, therefore, this encoder is disadvantageous in that it only provides a single speech coding mode within the core layer. Hence, quality problems will occur when this coder is applied to other signals than speech signals, i.e., to general audio signals, for which the model behind CELP encoding is not appropriate.
An additional switched codec is the so-called USAC codec, i.e., the unified speech and audio codec as defined in ISO/IEC CD 23003-3 dated Sep. 24, 2010. The LPC analysis window used for this switched codec is indicated in FIG. 5d at 516. Again, a current frame extending between 0 and 20 ms is assumed and, therefore, it appears that the look-ahead portion 618 of this codec is 20 ms, i.e., is significantly higher than the look-ahead portion of G.718. Hence, although the USAC encoder provides a good audio quality due to its switched nature, the delay is considerable due to the LPC analysis window look-ahead portion 518 in FIG. 5d. The general structure of USAC is as follows. First, there is a common pre/postprocessing consisting of an MPEG surround (MPEGS) functional unit to handle stereo or multi-channel processing and an enhanced SBR (eSBR) unit which handles the parametric representation of the higher audio frequency in the input signal. Then, there are two branches, one consisting of a modified advanced audio coding (AAC) tool path and the other consisting of a linear prediction coding (LP or LPC domain) based path, which in turn features either a frequency domain representation or a time-domain representation of the LPC residual. All transmitted spectra for both, AAC and LPC, are represented in MDCT domain following quantization and arithmetic coding. The time-domain representation uses an ACELP excitation coding scheme. The ACELP tool provides a way to efficiently represent a time domain excitation signal by combining a long-term predictor (adaptive codeword) with a pulse-like sequence (innovation codeword). The reconstructed excitation is sent through an LP synthesis filter to form a time domain signal. The input to the ACELP tool comprises adaptive and innovation codebook indices, adaptive and innovation codes gain values, other control data and inversely quantized and interpolated LPC filter coefficients. The output of the ACELP tool is the time-domain reconstructed audio signal.
The MDCT-based TCX decoding tool is used to turn the weighted LP residual representation from an MDCT domain back into a time domain signal and outputs the weighted time-domain signal including weighted LP synthesis filtering. The IMDCT can be configured to support 256, 512 or 1024 spectral coefficients. The input to the TCX tool comprises the (inversely quantized) MDCT spectra, and inversely quantized and interpolated LPC filter coefficients. The output of the TCX tool is the time-domain reconstructed audio signal.
FIG. 6 illustrates a situation in USAC, where the LPC analysis windows 516 for the current frame and 520 for the past or last frame are drawn, and where, in addition, a TCX window 522 is illustrated. The TCX window 522 is centered at the center of the current frame extending between 0 and 20 ms and extends 10 ms into the past frame and 10 ms into the future frame extending between 20 and 40 ms. Hence, the LPC analysis window 516 necessitates an LPC look-ahead portion between 20 and 40 ms, i.e., 20 ms, while the TCX analysis window additionally has a look-ahead portion extending between 20 and 30 ms into the future frame. This means that the delay introduced by the USAC analysis window 516 is 20 ms, while the delay introduced into the encoder by the TCX window is 10 ms. Hence. It becomes clear that the look-ahead portions of both kinds of windows are not aligned to each other. Therefore, even though the TCX window 522 only introduces a delay of 10 ms, the whole delay of the encoder is nevertheless 20 ms due to the LPC analysis window 516. Therefore, even though there is a quite small look-ahead portion for the TCX window, this does not reduce the overall algorithmic delay of the encoder, since the total delay is determined by the highest contribution, i.e., is equal to 20 ms due to the LPC analysis window 516 extending 20 ms into the future frame, i.e., not only covering the current frame but additionally covering the future frame.
It is an object of the present invention to provide an improved coding concept for audio coding or decoding which, on the one hand, provides a good audio quality and which, on the other hand, results in a reduced delay.