The present invention relates to audio signal encoding and decoding and, in particular, to audio signal processing using parallel frequency domain and time domain encoder/decoder processors.
The perceptual coding of audio signals for the purpose of data reduction for efficient storage or transmission of these signals is a widely used practice. In particular when lowest bit rates are to be achieved, the employed coding leads to a reduction of audio quality that often is primarily caused by a limitation at the encoder side of the audio signal bandwidth to be transmitted. Here, typically the audio signal is low-pass filtered such that no spectral waveform content remains above a certain pre-determined cut-off frequency.
In contemporary codecs well-known methods exist for the decoder-side signal restoration through audio signal Bandwidth Extension (BWE), e.g. Spectral Band Replication (SBR) that operates in frequency domain or so-called Time Domain Bandwidth Extension (TD-BWE) being is a post-processor in speech coders that operates in time domain.
Additionally, several combined time domain/frequency domain coding concepts exist such as concepts known under the term AMR-WB+ or USAC.
All these combined time domain/coding concepts have in common that the frequency domain coder relies on bandwidth extension technologies which incur a band limitation into the input audio signal and the portion above a cross-over frequency or border frequency is encoded with a low resolution coding concept and synthesized on the decoder-side. Hence, such concepts mainly rely on a pre-processor technology on the encoder side and a corresponding post-processing functionality on the decoder-side.
Typically, the time domain encoder is selected for useful signals to be encoded in the time domain such as speech signals and the frequency domain encoder is selected for non-speech signals, music signals, etc. However, specifically for non-speech signals having prominent harmonics in the high frequency band, the known frequency domain encoders have a reduced accuracy and, therefore, a reduced audio quality due to the fact that such prominent harmonics can only be separately parametrically encoded or are eliminated at all in the encoding/decoding process.
Furthermore, concepts exist in which the time domain encoding/decoding branch additionally relies on the bandwidth extension which also parametrically encodes an upper frequency range while a lower frequency range is typically encoded using an ACELP or any other CELP related coder, for example a speech coder. This bandwidth extension functionality increases the bitrate efficiency but, on the other hand, introduces further inflexibility due to the fact that both encoding branches, i.e., the frequency domain encoding branch and the time domain encoding branch are band limited due to the bandwidth extension procedure or spectral band replication procedure operating above a certain crossover frequency substantially lower than the maximum frequency included in the input audio signal.
Relevant topics in the state-of-art comprise                SBR as a post-processor to waveform decoding [1-3]        MPEG-D USAC core switching [4]        MPEG-H 3D IGF [5]        
The following papers and patents describe methods that are considered to constitute conventional technology for the application:    [1] M. Dietz, L. Liljeryd, K. Kjörling and O. Kunz, “Spectral Band Replication, a novel approach in audio coding,” in 112th AES Convention, Munich, Germany, 2002.    [2] S. Meltzer, R. Böhm and F. Henn, “SBR enhanced audio codecs for digital broadcasting such as “Digital Radio Mondiale” (DRM),” in 112th AES Convention, Munich, Germany, 2002.    [3] T. Ziegler, A. Ehret, P. Ekstrand and M. Lutzky, “Enhancing mp3 with SBR: Features and Capabilities of the new mp3PRO Algorithm,” in 112th AES Convention, Munich, Germany, 2002.    [4] MPEG-D USAC Standard.    [5] PCT/EP2014/065109.
In MPEG-D USAC, a switchable core coder is described. However, in USAC, the band-limited core is restricted to at all times transmit a low-pass filtered signal. Therefore, certain music signals that contain prominent high frequency content e.g. full-band sweeps, triangle sounds, etc. cannot be reproduced faithfully.
According to an embodiment, an audio encoder for encoding an audio signal may have: a first encoding processor for encoding a first audio signal portion in a frequency domain, wherein the first encoding processor has: a time frequency converter for converting the first audio signal portion into a frequency domain representation having spectral lines up to a maximum frequency of the first audio signal portion; a spectral encoder for encoding the frequency domain representation; a second encoding processor for encoding a second different audio signal portion in the time domain, wherein the second encoding processor has an associated second sampling rate, wherein the first encoding processor has associated therewith a first sampling rate being different from the second sampling rate; a cross-processor for calculating, from the encoded spectral representation of the first audio signal portion, initialization data of the second encoding processor, so that the second encoding processing is initialized to encode the second audio signal portion immediately following the first audio signal portion in time in the audio signal; wherein the cross-processor has a frequency-time converter for generating a time domain signal at the second sampling rate, wherein the frequency time converter has: a selector for selecting a portion of a spectrum input into the frequency time converter in accordance with a ratio of the first sampling rate and the second sampling rate, a transform processor having a transform length being different from a transform length of the time-frequency converter; and a synthesis windower for windowing using a window having a different number of window coefficients compared to a window used by the time frequency converter; a controller configured for analyzing the audio signal and for determining, which portion of the audio signal is the first audio signal portion encoded in the frequency domain and which portion of the audio signal is the second audio signal portion encoded in the time domain; and an encoded signal former for forming an encoded audio signal having a first encoded signal portion for the first audio signal portion and a second encoded signal portion for the second audio signal portion.
According to another embodiment, an audio decoder for decoding an encoded audio signal may have: a first decoding processor for decoding a first encoded audio signal portion in a frequency domain, the first decoding processor having a frequency-time converter for converting a decoded spectral representation into a time domain to obtain a decoded first audio signal portion; a second decoding processor for decoding a second encoded audio signal portion in the time domain to obtain a decoded second audio signal portion; a cross-processor for calculating, from the decoded spectral representation of the first encoded audio signal portion, initialization data of the second decoding processor, so that the second decoding processor is initialized to decode the encoded second audio signal portion following in time the first audio signal portion in the encoded audio signal; and a combiner for combining the decoded first spectral portion and the decoded second spectral portion to obtain a decoded audio signal, wherein the cross-processor further has a further frequency-time converter operating at a first effective sampling rate being different from a second effective sampling rate associated with the frequency-time converter of the first decoding processor to obtain a further decoded first signal portion in the time domain, wherein the signal output by the further frequency-time converter has the second sampling rate being different from the first sampling rate associated with an output of the frequency-time converter of the first decoding processor, wherein the further frequency-time converter has a selector for selecting a portion of a spectrum input into the further frequency-time converter in accordance with a ratio of the first sampling rate and the second sampling rate; a transform processor having a transform length being different from a transform length of the time-frequency converter of the first decoding processor; and a synthesis windower using a window having a different number of coefficients compared to a window used by the frequency-time converter of the first decoding processor.
According to another embodiment, a method of encoding an audio signal may have the steps of: encoding a first audio signal portion in a frequency domain, having the steps of: converting the first audio signal portion into a frequency domain representation having spectral lines up to a maximum frequency of the first audio signal portion; encoding the frequency domain representation; encoding a second different audio signal portion in the time domain; wherein the encoding the second audio signal portion has an associated second sampling rate, wherein the encoding the first audio signal portion has associated therewith a first sampling rate being different from the second sampling rate calculating, from the encoded spectral representation of the first audio signal portion, initialization data for the step of encoding the second different audio signal portion, so that the step of encoding the second different audio signal portion is initialized to encode the second audio signal portion immediately following the first audio signal portion in time in the audio signal wherein the calculating has the step of generating, by a frequency-time converter, a time domain signal at the second sampling rate, wherein the generating has the steps of: selecting a portion of a spectrum input into the frequency-time converter in accordance with a ratio of the first sampling rate and the second sampling rate, processing using a transform processor having a transform length being different from a transform length of a time-frequency converter used in the converting the first audio signal portion; and synthesis windowing using a window having a different number of window coefficients compared to a window used by the time frequency converter used in the converting the first audio signal portion; analyzing the audio signal and determining, which portion of the audio signal is the first audio signal portion encoded in the frequency domain and which portion of the audio signal is the second audio signal portion encoded in the time domain; and forming an encoded audio signal having a first encoded signal portion for the first audio signal portion and a second encoded signal portion for the second audio signal portion.
According to another embodiment, a method of decoding an encoded audio signal may have the steps of: decoding, by a first decoding processor, a first encoded audio signal portion in a frequency domain, the decoding having the steps of: converting, by a frequency-time converter, a decoded spectral representation into a time domain to obtain a decoded first audio signal portion; decoding a second encoded audio signal portion in the time domain to obtain a decoded second audio signal portion; calculating, from the decoded spectral representation of the first encoded audio signal portion, initialization data of the step of decoding the second encoded audio signal portion, so that the step of decoding the second encoded audio signal portion is initialized to decode the encoded second audio signal portion following in time the first audio signal portion in the encoded audio signal; and combining the decoded first spectral portion and the decoded second spectral portion to obtain a decoded audio signal, wherein the calculating further has the step of using a further frequency-time converter operating at a first effective sampling rate being different from a second effective sampling rate associated with the frequency-time converter of the first decoding processor to obtain a further decoded first signal portion in the time domain, wherein the signal output by the further frequency-time converter has the second sampling rate being different from the first sampling rate associated with an output of the frequency-time converter of the first decoding processor, wherein the using the further frequency-time converter has the steps of: selecting a portion of a spectrum input into the further frequency-time converter in accordance with a ratio of the first sampling rate and the second sampling rate; using a transform processor having a transform length being different from a transform length of the time-frequency converter of the first decoding processor; and using a synthesis windower using a window having a different number of coefficients compared to a window used by the frequency-time converter of the first decoding processor.
Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform the method of encoding an audio signal, may have the steps of: encoding a first audio signal portion in a frequency domain, having the steps of: converting the first audio signal portion into a frequency domain representation having spectral lines up to a maximum frequency of the first audio signal portion; encoding the frequency domain representation; encoding a second different audio signal portion in the time domain; wherein the encoding the second audio signal portion has an associated second sampling rate, wherein the encoding the first audio signal portion has associated therewith a first sampling rate being different from the second sampling rate calculating, from the encoded spectral representation of the first audio signal portion, initialization data for the step of encoding the second different audio signal portion, so that the step of encoding the second different audio signal portion is initialized to encode the second audio signal portion immediately following the first audio signal portion in time in the audio signal wherein the calculating the step of generating, by a frequency-time converter, a time domain signal at the second sampling rate, wherein the generating the steps of: selecting a portion of a spectrum input into the frequency-time converter in accordance with a ratio of the first sampling rate and the second sampling rate, processing using a transform processor having a transform length being different from a transform length of a time-frequency converter used in the converting the first audio signal portion; and synthesis windowing using a window having a different number of window coefficients compared to a window used by the time frequency converter used in the converting the first audio signal portion; analyzing the audio signal and determining, which portion of the audio signal is the first audio signal portion encoded in the frequency domain and which portion of the audio signal is the second audio signal portion encoded in the time domain; and forming an encoded audio signal having a first encoded signal portion for the first audio signal portion and a second encoded signal portion for the second audio signal portion.
Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform the method of decoding an encoded audio signal, having the steps of: decoding, by a first decoding processor, a first encoded audio signal portion in a frequency domain, the decoding having the steps of: converting, by a frequency-time converter, a decoded spectral representation into a time domain to obtain a decoded first audio signal portion; decoding a second encoded audio signal portion in the time domain to obtain a decoded second audio signal portion; calculating, from the decoded spectral representation of the first encoded audio signal portion, initialization data of the step of decoding the second encoded audio signal portion, so that the step of decoding the second encoded audio signal portion is initialized to decode the encoded second audio signal portion following in time the first audio signal portion in the encoded audio signal; and combining the decoded first spectral portion and the decoded second spectral portion to obtain a decoded audio signal, wherein the calculating further has the step of using a further frequency-time converter operating at a first effective sampling rate being different from a second effective sampling rate associated with the frequency-time converter of the first decoding processor to obtain a further decoded first signal portion in the time domain, wherein the signal output by the further frequency-time converter has the second sampling rate being different from the first sampling rate associated with an output of the frequency-time converter of the first decoding processor, wherein the using the further frequency-time converter has the steps of: selecting a portion of a spectrum input into the further frequency-time converter in accordance with a ratio of the first sampling rate and the second sampling rate; using a transform processor having a transform length being different from a transform length of the time-frequency converter of the first decoding processor; and using a synthesis windower using a window having a different number of coefficients compared to a window used by the frequency-time converter of the first decoding processor.
The present invention is based on the finding that a time domain encoding/decoding processor can be combined with a frequency domain encoding/decoding processor having a gap filling functionality but this gap filling functionality for filling spectral holes is operated over the whole band of the audio signal or at least above a certain gap filling frequency. Importantly, the frequency domain encoding/decoding processor is particularly in the position to perform accurate or wave form or spectral value encoding/decoding up to the maximum frequency and not only until a crossover frequency. Furthermore, the full-band capability of the frequency domain encoder for encoding with the high resolution allows an integration of the gap filling functionality into the frequency domain encoder.
In one aspect, full band gap filling is combined with a time-domain encoding/decoding processor. In embodiments, the sampling rates in both branches are equal or the sampling rate in the time-domain encoder branch is lower than in the frequency domain branch.
In another aspect, a frequency domain encoder/decoder operating without gap filling but performing a full band core encoding/decoding is combined with a time-domain encoding processor and a cross processor is provided for continuous initialization of the time-domain encoding/decoding processor. In this aspect, the sampling rates can be as in the other aspect, or the sampling rates in the frequency domain branch are even lower than in the time-domain branch.
Hence, in accordance with the present invention by using the full-band spectral encoder/decoder processor, the problems related to the separation of the bandwidth extension on the one hand and the core coding on the other hand can be addressed and overcome by performing the bandwidth extension in the same spectral domain in which the core decoder operates. Therefore, a full rate core decoder is provided which encodes and decodes the full audio signal range. This does not require the need for a downsampler on the encoder side and an upsampler on the decoder side. Instead, the whole processing is performed in the full sampling rate or full-bandwidth domain. In order to obtain a high coding gain, the audio signal is analyzed in order to find a first set of first spectral portions which has to be encoded with a high resolution, where this first set of first spectral portions may include, in an embodiment, tonal portions of the audio signal. On the other hand, non-tonal or noisy components in the audio signal constituting a second set of second spectral portions are parametrically encoded with low spectral resolution. The encoded audio signal then only necessitates the first set of first spectral portions encoded in a waveform-preserving manner with a high spectral resolution and, additionally, the second set of second spectral portions encoded parametrically with a low resolution using frequency “tiles” sourced from the first set. On the decoder side, the core decoder, which is a full-band decoder, reconstructs the first set of first spectral portions in a waveform-preserving manner, i.e., without any knowledge that there is any additional frequency regeneration. However, the so generated spectrum has a lot of spectral gaps. These gaps are subsequently filled with the Intelligent Gap Filling (IGF) technology by using a frequency regeneration applying parametric data on the one hand and using a source spectral range, i.e., first spectral portions reconstructed by the full rate audio decoder on the other hand.
In further embodiments, spectral portions, which are reconstructed by noise filling only rather than bandwidth replication or frequency tile filling, constitute a third set of third spectral portions. Due to the fact that the coding concept operates in a single domain for the core coding/decoding on the one hand and the frequency regeneration on the other hand, the IGF is not only restricted to fill up a higher frequency range but can fill up lower frequency ranges, either by noise filling without frequency regeneration or by frequency regeneration using a frequency tile at a different frequency range.
Furthermore, it is emphasized that an information on spectral energies, an information on individual energies or an individual energy information, an information on a survive energy or a survive energy information, an information a tile energy or a tile energy information, or an information on a missing energy or a missing energy information may comprise not only an energy value, but also an (e.g. absolute) amplitude value, a level value or any other value, from which a final energy value can be derived. Hence, the information on an energy may e.g. comprise the energy value itself, and/or a value of a level and/or of an amplitude and/or of an absolute amplitude.
A further aspect is based on the finding that the correlation situation is not only important for the source range but is also important for the target range. Furthermore, the present invention acknowledges the situation that different correlation situations can occur in the source range and the target range. When, for example, a speech signal with high frequency noise is considered, the situation can be that the low frequency band comprising the speech signal with a small number of overtones is highly correlated in the left channel and the right channel, when the speaker is placed in the middle. The high frequency portion, however, can be strongly uncorrelated due to the fact that there might be a different high frequency noise on the left side compared to another high frequency noise or no high frequency noise on the right side. Thus, when a straightforward gap filling operation would be performed that ignores this situation, then the high frequency portion would be correlated as well, and this might generate serious spatial segregation artifacts in the reconstructed signal. In order to address this issue, parametric data for a reconstruction band or, generally, for the second set of second spectral portions which have to be reconstructed using a first set of first spectral portions is calculated to identify either a first or a second different two-channel representation for the second spectral portion or, stated differently, for the reconstruction band. On the encoder side, a two-channel identification is, therefore calculated for the second spectral portions, i.e., for the portions, for which, additionally, energy information for reconstruction bands is calculated. A frequency regenerator on the decoder side then regenerates a second spectral portion depending on a first portion of the first set of first spectral portions, i.e., the source range and parametric data for the second portion such as spectral envelope energy information or any other spectral envelope data and, additionally, dependent on the two-channel identification for the second portion, i.e., for this reconstruction band under reconsideration.
The two-channel identification is advantageously transmitted as a flag for each reconstruction band and this data is transmitted from an encoder to a decoder and the decoder then decodes the core signal as indicated by advantageously calculated flags for the core bands. Then, in an implementation, the core signal is stored in both stereo representations (e.g. left/right and mid/side) and, for the IGF frequency tile filling, the source tile representation is chosen to fit the target tile representation as indicated by the two-channel identification flags for the intelligent gap filling or reconstruction bands, i.e., for the target range.
It is emphasized that this procedure not only works for stereo signals, i.e., for a left channel and the right channel but also operates for multi-channel signals. In the case of multi-channel signals, several pairs of different channels can be processed in that way such as a left and a right channel as a first pair, a left surround channel and a right surround as the second pair and a center channel and an LFE channel as the third pair. Other pairings can be determined for higher output channel formats such as 7.1, 11.1 and so on.
A further aspect is based on the finding that the audio quality of the reconstructed signal can be improved through IGF since the whole spectrum is accessible to the core encoder so that, for example, perceptually important tonal portions in a high spectral range can still be encoded by the core coder rather than parametric substitution. Additionally, a gap filling operation using frequency tiles from a first set of first spectral portions which is, for example, a set of tonal portions typically from a lower frequency range, but also from a higher frequency range if available, is performed. For the spectral envelope adjustment on the decoder side, however, the spectral portions from the first set of spectral portions located in the reconstruction band are not further post-processed by e.g. the spectral envelope adjustment. Only the remaining spectral values in the reconstruction band which do not originate from the core decoder are to be envelope adjusted using envelope information. Advantageously, the envelope information is a full-band envelope information accounting for the energy of the first set of first spectral portions in the reconstruction band and the second set of second spectral portions in the same reconstruction band, where the latter spectral values in the second set of second spectral portions are indicated to be zero and are, therefore, not encoded by the core encoder, but are parametrically coded with low resolution energy information.
It has been found that absolute energy values, either normalized with respect to the bandwidth of the corresponding band or not normalized, are useful and very efficient in an application on the decoder side. This especially applies when gain factors have to be calculated based on a residual energy in the reconstruction band, the missing energy in the reconstruction band and frequency tile information in the reconstruction band.
Furthermore, it is advantageous that the encoded bitstream not only covers energy information for the reconstruction bands but, additionally, scale factors for scale factor bands extending up to the maximum frequency. This ensures that for each reconstruction band, for which a certain tonal portion, i.e., a first spectral portion is available, this first set of first spectral portion can actually be decoded with the right amplitude. Furthermore, in addition to the scale factor for each reconstruction band, an energy for this reconstruction band is generated in an encoder and transmitted to a decoder. Furthermore, it is advantageous that the reconstruction bands coincide with the scale factor bands or in case of energy grouping, at least the borders of a reconstruction band coincide with borders of scale factor bands.
A further implementation of this invention applies a tile whitening operation. Whitening of a spectrum removes the coarse spectral envelope information and emphasizes the spectral fine structure which is of foremost interest for evaluating tile similarity. Therefore, a frequency tile on the one hand and/or the source signal on the other hand are whitened before calculating a cross correlation measure. When only the tile is whitened using a predefined procedure, a whitening flag is transmitted indicating to the decoder that the same predefined whitening process shall be applied to the frequency tile within IGF.
Regarding the tile selection, it is advantageous to use the lag of the correlation to spectrally shift the regenerated spectrum by an integer number of transform bins. Depending on the underlying transform, the spectral shifting may necessitate addition corrections. In case of odd lags, the tile is additionally modulated through multiplication by an alternating temporal sequence of −1/1 to compensate for the frequency-reversed representation of every other band within the MDCT. Furthermore, the sign of the correlation result is applied when generating the frequency tile.
Furthermore, it is advantageous to use tile pruning and stabilization in order to make sure that artifacts created by fast changing source regions for the same reconstruction region or target region are avoided. To this end, a similarity analysis among the different identified source regions is performed and when a source tile is similar to other source tiles with a similarity above a threshold, then this source tile can be dropped from the set of potential source tiles since it is highly correlated with other source tiles. Furthermore, as a kind of tile selection stabilization, it is advantageous to keep the tile order from the previous frame if none of the source tiles in the current frame correlate (better than a given threshold) with the target tiles in the current frame.
A further aspect is based on the finding that an improved quality and reduced bitrate specifically for signals comprising transient portions as they occur very often in audio signals is obtained by combining the Temporal Noise Shaping (TNS) or Temporal Tile Shaping (TTS) technology with high frequency reconstruction. The TNS/TTS processing on the encoder-side being implemented by a prediction over frequency reconstructs the time envelope of the audio signal. Depending on the implementation, i.e., when the temporal noise shaping filter is determined within a frequency range not only covering the source frequency range but also the target frequency range to be reconstructed in a frequency regeneration decoder, the temporal envelope is not only applied to the core audio signal up to a gap filling start frequency, but the temporal envelope is also applied to the spectral ranges of reconstructed second spectral portions. Thus, pre-echoes or post-echoes that would occur without temporal tile shaping are reduced or eliminated. This is accomplished by applying an inverse prediction over frequency not only within the core frequency range up to a certain gap filling start frequency but also within a frequency range above the core frequency range. To this end, the frequency regeneration or frequency tile generation is performed on the decoder-side before applying a prediction over frequency. However, the prediction over frequency can either be applied before or subsequent to spectral envelope shaping depending on whether the energy information calculation has been performed on the spectral residual values subsequent to filtering or to the (full) spectral values before envelope shaping.
The TTS processing over one or more frequency tiles additionally establishes a continuity of correlation between the source range and the reconstruction range or in two adjacent reconstruction ranges or frequency tiles.
In an implementation, it is advantageous to use complex TNS/TTS filtering. Thereby, the (temporal) aliasing artifacts of a critically sampled real representation, like MDCT, are avoided. A complex TNS filter can be calculated on the encoder-side by applying not only a modified discrete cosine transform but also a modified discrete sine transform in addition to obtain a complex modified transform. Nevertheless, only the modified discrete cosine transform values, i.e., the real part of the complex transform is transmitted. On the decoder-side, however, it is possible to estimate the imaginary part of the transform using MDCT spectra of preceding or subsequent frames so that, on the decoder-side, the complex filter can be again applied in the inverse prediction over frequency and, specifically, the prediction over the border between the source range and the reconstruction range and also over the border between frequency-adjacent frequency tiles within the reconstruction range.
The inventive audio coding system efficiently codes arbitrary audio signals at a wide range of bitrates. Whereas, for high bitrates, the inventive system converges to transparency, for low bitrates perceptual annoyance is minimized. Therefore, the main share of available bitrate is used to waveform code just the perceptually most relevant structure of the signal in the encoder, and the resulting spectral gaps are filled in the decoder with signal content that roughly approximates the original spectrum. A very limited bit budget is consumed to control the parameter driven so-called spectral Intelligent Gap Filling (IGF) by dedicated side information transmitted from the encoder to the decoder.
In further embodiments, the time domain encoding/decoding processor relies on a lower sampling rate and the corresponding bandwidth extension functionality.
In further embodiments, a cross-processor is provided in order to initialize the time domain encoder/decoder with initialization data derived from the currently processed frequency domain encoder/decoder signal This allows that when the currently processed audio signal portion is processed by the frequency domain encoder, the parallel time domain encoder is initialized so that when a switch from the frequency domain encoder to a time domain encoder takes place, this time domain encoder can immediately start processing since all the initialization data relating to earlier signals are already there due to the cross-processor. This cross-processor is advantageously applied on the encoder-side and, additionally, on the decoder-side and advantageously uses a frequency-time transform which additionally performs a very efficient downsampling from the higher output or input sampling rate into the lower time domain core coder sampling rate by only selecting a certain low band portion of the domain signal together with a certain reduced transform size. Thus, a sample rate conversion from the high sampling rate to the low sampling rate is very efficiently performed and this signal obtained by the transform with the reduced transform size can then be used for initializing the time domain encoder/decoder so that the time domain encoder/decoder is ready to immediately perform time domain encoding when this situation is signaled by a controller and the immediately preceding audio signal portion was encoded in the frequency domain.
As outlined, the cross-processor embodiment may rely on gap filling in the frequency domain or not. Hence, a time- and frequency domain encoder/decoder are combined via the cross-processor, and the frequency domain encoder/decoder may rely on gap filling or not. Specifically, certain embodiments as outlined are advantageous:
These embodiments employ gap filling in the frequency domain and have the following sampling rate figures and may or may not rely on the cross-processor technology:    Input SR=8 kHz, ACELP (time domain) SR=12.8 kHz.    Input SR=16 kHz, ACELP SR=12.8 kHz.    Input SR=16 kHz, ACELP SR=16.0 kHz    Input SR=32.0 kHz, ACELP SR=16.0 kHzl    Input SR=48 kHz, ACELP SR=16 kHz
These embodiments may or may not employ gap filling in the frequency domain and have the following sampling rate figures and rely on the cross-processor technology:
TCX SR is lower than the ACELP SR (8 kHz vs. 12.8 kHz), or where TCX and ACELP run both at 16.0 kHz, and where any gap filling is not used.
Hence, embodiments of the present invention allow a seamless switching of a perceptual audio coder comprising spectral gap filling and a time domain encoder with or without bandwidth extension.
Hence, the present invention relies on methods that are not restricted to removing the high frequency content above a cut-off frequency in the frequency domain encoder from the audio signal but rather signal-adaptively removes spectral band-pass regions leaving spectral gaps in the encoder and subsequently reconstructs these spectral gaps in the decoder. Advantageously, an integrated solution such as intelligent gap filling is used that efficiently combines full-bandwidth audio coding and spectral gap filling particularly in the MDCT transform domain.
Hence, the present invention provides an improved concept for combining speech coding and a subsequent time domain bandwidth extension with a full-band wave form decoding comprising spectral gap filling into a switchable perceptual encoder/decoder.
Hence, in contrast to already existing methods, the new concept utilizes full-band audio signal wave form coding in the transform domain coder and at the same time allows a seamless switching to a speech coder advantageously followed by a time domain bandwidth extension.
Further embodiments of the present invention avoid the explained problems that occur due to a fixed band limitation. The concept enables the switchable combination of a full-band wave form coder in the frequency domain equipped with a spectral gap filling and a lower sampling rate speech coder and a time domain bandwidth extension. Such a coder is capable of wave form coding the aforementioned problematic signals providing full audio bandwidth up to the Nyquist frequency of the audio input signal. Nevertheless, seamless instant switching between both coding strategies is guaranteed particularly by the embodiments having the cross-processor. For this seamless switching, the cross-processor represents a cross connection at both encoder and decoder between the full-band capable full-rate (input sampling rate) frequency domain encoder and the low-rate ACELP coder having a lower sampling rate to properly initialize the ACELP parameters and buffers particularly within the adaptive codebook, the LPC filter or the resampling stage, when switching from the frequency domain coder such as TCX to the time domain encoder such as ACELP.