In recent years, the distribution and storage of A/V content in digital form has increased substantially. Accordingly, a large number of coding standards and protocols have been developed.
Audio coding and compression techniques provide for very efficient audio encoding which allows audio files of relatively low data size and high quality to be conveniently distributed through data networks including for example the Internet.
An example of a coding standard is the Motion Picture Expert Group-4 (MPEG-4) coding standard which provides decoder specifications for both video and audio coding. Further details of the MPEG-4 coding standard may be found in “Coding of Audio-Visual Objects”, MPEG-4: ISO/IEC 14496.
A technique which may be applied to audio signals to alter the play back speed and duration of an audio signal without altering its perceived pitch is known as time scaling or tempo scaling. There are a number of interesting applications for time scaling, including for example audio/video synchronization, language learning, tools for people with impaired hearing, answering machines, spoken books, etc.
In general, time scaling is applied as a post-processing technique. Therefore, for conventional waveform coded material, an additional amount of complexity is introduced, as both regular decoding and complex time scaling processing must be performed. Furthermore, time scale processing typically introduces artefacts into the decoded signal and therefore degrades the quality of the time scaled signal. In order to achieve an acceptable quality it is necessary to use very complex time scaling algorithms resulting in increased computational requirements.
An advantage of parametric audio coding in comparison to waveform coding is that the parametric representation of an audio signal facilitates effects processing like e.g. time and/or pitch scaling processing at relatively low complexity. An example of parametric audio coding may be found in “Advances in Parametric Coding for High-Quality Audio” by Erik Schuijers, Werner Oomen, Bert den Brinker and Jeroen Breebaart, Preprint 5852, 114th AES Convention, Amsterdam, The Netherlands, 22-25 Mar. 2003.
This parametric coding scheme is currently under standardization and currently described in MPEG-4 Extension 2, “Coding of Moving Pictures and Audio, Parametric coding for High Quality Audio”, ISO/IEC 14496-3:2001/FPDAM2, JTC1/SC29/WG11 and to be formally standardized in ISO/IEC 14496-3:2001/AMD2. For convenience, the term MPEG-4 extension 2 will be used in this specification. In accordance with MPEG-4 Extension 2 a stereo audio signal may be represented by the following parameter data:
Transient parameter data which represents the non-stationary part of the audio signal.
Sinusoid parameter data which represents the tonal part of the audio signal.
Noise parameter data representing the non-tonal (or stochastic) part of an audio signal.
Stereo imaging data.
MPEG-4 Extension 2 provides for stereo signals to be encoded by a Parametric Stereo (PS) algorithm. In PS, stereo audio encoding is achieved by coding a stereo audio signal as a mono signal and a small amount of stereo imaging parameters. The resulting mono signal can then be encoded by a (parametric) mono encoder. At the decoder, the mono encoded channel is expanded into stereo channels by applying the stereo imaging parameters to the decoded mono signal. The stereo parameters consist of Inter-channel Intensity Differences (IID), Inter-channel Time or Phase differences (ITD or IPD) and Inter-Channel Coherence (ICC) (or Inter-channel Cross-Correlations).
FIG. 1 illustrates an example of an MPEG-4 Extension 2 parametric stereo decoder in accordance with prior art.
The decoder 100 comprises a receiver 101 which receives an incoming, MPEG-4 Extension 2 bitstream and de-multiplexes this. The receiver 101 is coupled to decoding unit 103 to which transient, sinusoid and noise parameter data is fed. In response, the decoding unit 103 generates a mono signal.
The decoding unit 103 is coupled to a stereo processor 105 which is further coupled to the receiver 101. The stereo processor 105 receives the mono signal from the decoding unit 103 and the stereo imaging data from the receiver 101 and in response generates a stereo signal in accordance with the MPEG-4 Extension 2 parametric stereo decoding algorithm.
Parametric audio coding permits a relatively low complexity time scaling to be performed in the decoder. FIG. 2 illustrates an example of an MPEG-4 Ext. 2 time and/or pitch scaling parametric stereo decoder 200 in accordance with prior art. The decoder 200 is identical to the decoder 100 of FIG. 1 except that it further comprises a time/pitch scale unit 201. Corresponding blocks of the decoder 200 and decoder 100 have the same reference signs in FIGS. 1 and 2.
The time/pitch scale unit 201 is coupled between the receiver 100 and the decoding unit 103. The time/pitch scale unit 201 is operable to modify the parameter data before these are used to generate the decoded signal. Thus the parameters may be modified to achieve a desired tempo and pitch.
FIG. 3 illustrates a parametric stereo decoder 300 in accordance with prior art. The parametric stereo decoder 300 receives the time domain mono signal from the decoding unit 103 and in response generates a de-correlated signal in a decorrelator 301. The mono signal is further fed to a first domain transform processor 303 which generates a frequency domain representation of the mono signal. Similarly, the de-correlated signal is fed to a second domain transform processor 305 which generates a frequency domain representation of the de-correlated signal.
The first and second domain transform processors 303, 305 are coupled to a parametric stereo decoder unit 307 wherein the signals are processed to generate left and right frequency domain channels. Specifically, the stereo imaging parameters of MPEG-4 Ext. 2 are time varying frequency dependent parameters. Accordingly, the frequency domain samples are modified by:
scaling (representing the Inter-channel Intensity Difference parameters),
rotation (representing the Inter-channel Phase Difference parameters) and
mixing (representing the Inter-channel Coherence parameters).
As a result, the frequency domain representations for the left and right signals are generated.
The parametric stereo decoder unit 307 is coupled to a first inverse transform processor 309 and a second inverse transform processor 311 which are fed the frequency domain left and right channels respectively and in response generates the time domain left and right channels.
Conventionally, the time domain to frequency domain transforms are performed by (analysis) windowing followed by a Fast Fourier Transform (FFT) and the frequency domain to time domain transforms are performed by an inverse Fast Fourier Transform (iFFT) followed by (synthesis) windowing and subsequent overlap and add combining data from successive blocks.
It will be appreciated that when applying time scaling, it is essential that a suitable synchronization is maintained between the time scaled mono signal (and the de-correlated signal) and the stereo image parameters in order to ensure that the appropriate stereo image parameters are applied to the right samples in the parametric stereo decoder unit 307.
Conventionally, the synchronization is achieved by adjusting the window sizes applied in both time-to-frequency and frequency-to-time transform. For example, if the time scaling of the mono signal is such that the tempo is increased, fewer time domain samples need to be generated between consecutive stereo parameter values. As a result, shorter analysis and synthesis windows are applied in (inverse) domain transform processors 303, 305, 309 and 311. However, in view of computational complexity, the (inverse) transform length is preferably kept constant. Hence, zero padding of the analysis and synthesis windows up to the pre-determined transform length is applied.
In the conventional approach, the stereo parameters are taken directly from the bitstream and used for the processing by the parametric stereo decoder unit 307. Accordingly, the stereo parameters and block processing of the parametric stereo decoder unit 307 may be considered to be synchronized with the original non-time scaled signal. In order to compensate for this, the block times of the FFT and iFFTs are modified accordingly by use of windowing techniques. This approach allows a very flexible and accurate time scaling with high granularity.
The complexity associated with windowing and FFTs is very high, especially in terms of memory requirements. In order to reduce complexity of the parametric stereo decoding tools, it is desirable to replace the time-to-frequency and frequency-to-time transform in the parametric stereo decoder by down-sampled complex-exponential modulated filter banks. The complex-valued sub-band domain samples are generated by convolution (filtering) of the input signal with a complex-exponential modulated proto-type filter. By application of decomposition techniques the number of multiplications and additions required for performing this filtering is minimized. Further description of down-sampled complex-exponential modulated filter banks may be found in “Bandwidth extension of audio Signals by Spectral Band replication” by P. Ekstrand, Proc. 1st IEEE Benelux Workshop on Model Base Processing and Coding of Audio (MPCA-2002), Leuven, Belgium, Nov. 15, 2002.
In contrary to the flexibility of the analysis/synthesis windowing in the FFT-based approach, usage of the complex modulated filter banks results in a fixed block based conversion and processing. In case of a typical 64-bands complex-modulated filter bank, for effectively each 64 input sample block, 64 complex-valued sub-band domain samples are generated as illustrated in FIG. 4. (It should be noted that the lower three bands are divided further in frequency for increased frequency resolution required for the stereo reconstruction). The time interval associated with each of these blocks is fixed. However, as the time intervals for the time scaled signals are constant, the length of corresponding time intervals of the non-time scaled signal varies depending on the time scaling applied. For example, for an increased tempo, 64 samples of the time scaled mono signal will correspond to more than 64 samples of the originally encoded non-time scaled time signal. As the stereo imaging parameter values of the bitstream are inherently synchronized with the originally encoded non-time scaled time signal and as the time to frequency domain transforms cannot compensate for the time scaling, the stereo imaging parameters will generally not be synchronized with the frequency domain samples in the stereo decoding unit.
Hence, an improved system for time scaling would be advantageous and in particular a system allowing for increased flexibility, lower complexity, performance and/or signal quality would be advantageous. In particular, an improved system for time scaling of an MPEG-4 stereo signal having reduced complexity and/or improved synchronization would be an advantage.