The present invention relates to the coding of audio signals and in particular to the coding of audio signals which exhibit transients (or xe2x80x9cattacksxe2x80x9d).
In hearing-adjusted coding for the data reduction of audio signals the coding of the audio signals usually takes place in the frequency domain. This means that output values of a time-frequency transform are quantized and are then written into a bit stream, which can be stored or transmitted. A psychoacoustic model, which is implemented in the coder, calculates an instantaneous masked hearing or masking threshold and controls the quantization of the output values of the time-frequency transform in such a way that the coding error, i.e. the quantization error, is spectrally shaped and lies below this threshold so that the error is inaudible. As a result of this measure, however, the coding error is constant in time over the number of sampled values corresponding to the length of the transform window. The masked hearing or masking threshold is described in M. Zollner, E. Zwicker, Elektroakustik, Springer-Verlag, Berlin, Heidelberg, New York, 3rd edn, 1993.
To enable the calculation of the masked hearing threshold in the frequency domain to be performed as exactly as possible, a high frequency resolution of the time-frequency transform is necessary. In practical application instances, typical transform lengths in the range from 20 to 40 ms can occur. If transient audio signals, i.e. audio signals with transients, are processed, the quantization noise may distribute itself xe2x80x9cbeforexe2x80x9d the maximum of the signal envelope curve, depending on the temporal position of the transient in the transform window. The nature of human perception is such that these so-called xe2x80x9cpre-echosxe2x80x9d can become audible if they occur more than 2 ms before the actual transient of the audio signal to be coded. This is the reason why, in many transform coders, the transform length of the time-frequency transform can be switched over to shorter windows, i.e. shorter block lengths, having a time length of typically 5 to 8 ms and consequently a higher time resolution. This enables a finer temporal shaping of the quantization noise and thus a suppression of these pre-echos, whereby these are no longer, or only very slightly, audible when the coded signal is decoded again in a decoder.
Devices for detecting a transient in an audio signal are thus used to match the transform length of the time-frequency transform to the properties, and in particular to the transient properties, of the audio signal as required by the human ear.
FIG. 3 shows a known transform coder 100, which is in general implemented according to the Standard MPEG 1-2 Layer 3 (ISO/IEC IS 11172-3, Coding of Moving Pictures and Associated Audio, Part 3: Audio). A time signal arrives via an input 102 at a block Time/frequency transform 104. The time signal at input 102, which is typically a discrete-time audio signal obtained from a continuous-time time signal by means of a sampling device (not shown), is transformed by the block Time/frequency transform 104 into consecutive blocks of spectral values, which are passed to a block Quantization/coding 106, the output signal of the block Quantization/coding consisting of quantized and redundancy-coded digital signals which, in a block Bit stream formatting 108, are, together with necessary side information, formed into a bit stream, which appears at the output of the bit stream formatter 108 and which can be stored or transmitted.
The discrete-time audio signals at the input 102 are windowed in the block Time/frequency transform 104 so as to generate consecutive blocks with discrete-time windowed audio signals. The blocks of windowed discrete-time audio signals are subsequently, as already mentioned, transformed into the frequency domain. As is known from the field of telecommunications, the frequency resolution of the time-frequency transform is determined by the length of a block. To achieve sufficient time resolution for discrete-time audio signals with transient parts, the window length and thus the time length of a block of discrete-time sampled values must be shortened when coding these signals in order to avoid the pre-echos.
The known coder shown in FIG. 3 performs the following method for detecting transients in an audio signal. From the block Time/frequency transform 104 the spectral components are fed into a block Psychoacoustic model 110, the block 110 establishing on the one hand, as already mentioned at the outset, the masking or masked hearing threshold for the block Quantization/coding 106 and, on the other, from the signal energy characteristic of the discrete-time audio signal in the frequency domain and the calculated energy characteristic of the masked hearing threshold, an estimated value for the bit demand for coding the spectrum. The estimated bit demand, which experts also refer to as xe2x80x9cperceptual entropyxe2x80x9d (xe2x80x9cpexe2x80x9d for short), is calculated from the following relationship:                               p          ⁢                      xe2x80x83                    ⁢          e                =                              ∑                          k              =              1                        N                    ⁢                                    1              2                        ⁢                                          log                2                            ⁡                              (                                                                            e                      ⁡                                              (                        k                        )                                                                                    n                      ⁡                                              (                        k                        )                                                                              +                  1                                )                                                                        (        1        )            
In equation (1) N is the number of spectral lines of a block, e(k) is the signal energy of the spectral components or spectral lines k and n(k) is the permitted interference energy of the line k. A rise in this perceptual entropy from one transform window to the next which exceeds a certain threshold value, designated as xe2x80x9cswitch_pexe2x80x9d, serves here to indicate a transient. If the threshold value switch_pe is exceeded, a switchover from a long window to a short window is effected in the block 104 so as to generate temporally shorter blocks of discrete-time audio signals in order to increase the time resolution of the transform coder 100. The calculation rule depicted in equation (1) and the specification of the threshold value switch_pe are stipulated in a block Bit demand estimation 112. The result of the bit demand estimation 112 is communicated to the time/frequency transform 104 and to the psychoacoustic model 110, as is indicated in FIG. 3.
A disadvantage of this known method is that the information on a possible transient or xe2x80x9cattackxe2x80x9d is not available until after the psychoacoustic model has been calculated. This has a particularly adverse effect on the temporal sequence structure of the coder, since the window information has to be fed back to the psychoacoustic model. Furthermore, changes in the parameters for calculating the masked hearing threshold always affect the value of the perceptual entropy. Changes in these parameters thus always entail changes in the window sequence, i.e. the sequence of long and short windows, of the transform.
FIG. 4 shows another known transform coder 150, which is essentially similar in design to the transform coder 100. In particular the same also has the input 102 for discrete-time audio signals, which are windowed and transformed into the frequency domain in the block 104. Taking account of the psychoacoustic model 110, the spectral output values of the block 104 are quantized and then coded in the block 106 and are written, together with side information, into an output bit stream by the bit stream formatter 108.
The transform coder 150 shown in FIG. 4 differs from the transform coder 100 shown in FIG. 3 in the detection of transients in the audio signal. The detection of transients in the audio signal at input 102 which is shown in FIG. 4 is described in the standard MPEG 2 AAC (see ISO/IEC IS 13818-7, Annex B, 2.1, MPEG-2 Advanced Audio Coding (AAC)). The block FFT transform and detection from the spectrum 152 performs detection of transients by means of a spectral energy rise. In particular, the discrete-time audio signal at input 102 is first transformed into the frequency domain by means of an FFT transform, the length of the FFT transform corresponding here to the transform length of the short windows. Then the FFT energies in the so-called xe2x80x9ccritical bandsxe2x80x9d are calculated. The xe2x80x9ccritical bandsxe2x80x9d constitute a frequency grouping which corresponds to the resolution of the psychoacoustic model. A threshold value comparison of the individual band energies over one or more consecutive windows now provides an indication of a transient.
In contrast to the known method shown in FIG. 3, the known method shown in FIG. 4 avoids the disadvantage of feeding back the window information to the psychoacoustic model 110. The method shown in FIG. 4 could, in principle, be used independently of the psychoacoustic model prior to its calculation. The method shown in FIG. 4 normally employs an FFT transform which is adapted to the transform length of the coder for calculating the energies in the individual frequency groups. Furthermore, if a realtime implementation of the coder is required, the Fourier transform performed specially for transient detection is too costly, i.e requires too high a computational effort in a digital signal processor (DSP), an effort which would be better exploited elsewhere in the coder, e.g. for quantization, for windowing or in the psychoacoustic model.
It is the object of the present invention to provide a method and a device for detecting a transient in a discrete-time audio signal and a method and a device for coding audio signals which enable reliable detection of transients, and thus simple suppression of pre-echos, in an efficient and simple way.
In accordance with a first aspect of the present invention, this object is achieved by a method for detecting a transient in a discrete-time audio signal, comprising the following steps:
(a) segmenting the discrete-time audio signal so as to generate consecutive segments of the same length with unfiltered discrete-time audio signals;
(b) filtering the discrete-time audio signal in a current segment, so as to obtain a filtered discrete-time audio signal wherein lower frequency spectral components are attenuated;
(c) comparing the energy of the filtered discrete-time audio signal in the current segment with the energy of the filtered discrete-time audio signal in a preceding segment; and/or
(d) determining a current relationship between the energy of the filtered discrete-time audio signal in the current segment and the energy of the unfiltered discrete-time audio signal in the current segment and comparing the current relationship with a corresponding preceding relationship; and
(e) detecting a transient on the basis of the comparison performed in step (c) and/or (d).
In accordance with a second aspect of the present invention, this object is achieved by a device for detecting a transient in a discrete-time audio signal, comprising:
(a) a segment generator for segmenting the discrete-time audio signal so as to generate consecutive segments of the same length with unfiltered discrete-time audio signals;
(b) a filter for filtering the discrete-time audio signal in a current segment, so as to obtain a filtered discrete-time audio signal wherein lower frequency spectral components are attenuated;
(c) a rise detector for comparing the energy of the filtered discrete-time audio signal in the current segment with the energy of the filtered discrete-time audio signal in a preceding segment; and/or
(d) a spectral detector for determining a current relationship between the energy of the filtered discrete-time audio signal in the current segment and the energy of the unfiltered discrete-time audio signal in the current segment and comparing the current relationship with a preceding corresponding relationship; and
(e) a transient detector for detecting a transient on the basis of the comparison performed by the rise detector and/or by the spectral detector.
In accordance with a third aspect of the present invention, this object is achieved by a device for coding a discrete-time audio signal, comprising:
(a) a transient detector for detecting a transient in the discrete-time audio signal comprising;
a segment generator for segmenting the discrete-time audio signal so as to generate consecutive segments of the same length with unfiltered discrete-time audio signals;
a filter for filtering the discrete-time audio signal in a current segment, so as to obtain a filtered discrete-time audio signal wherein lower frequency spectral components are attenuated;
a rise detector for comparing the energy of the filtered discrete-time audio signal in the current segment with the energy of the filtered discrete-time audio signal in a preceding segment; and/or
a spectral detector for determining a current relationship between the energy of the filtered discrete=time audio signal in the current segment and the energy of the unfiltered discrete-time audio signal in the current segment and comparing the current relationship with a preceding corresponding relationship; and
a transient detector for detecting a transient on the basis of the comparison performed by the rise detector and/or by the spectral detector;
(b) a block generator for windowing the discrete-time audio signal so as to generate blocks of discrete-time audio signals which responds to the transient detector so as to use a short window for windowing when the transient detector detects a transient;
(c) a time/frequency transformer for time/frequency transforming the blocks of the discrete-time audio signal so as to generate blocks of spectral components; and
(d) a quantizer and coder for quantizing and coding the blocks of spectral components.
In accordance with a fourth aspect of the present invention, this object is achieved by a method for coding a discrete-time audio signal, comprising the following steps:
(a) detecting a transient by
segmenting the discrete-time audio signal so as to generate consecutive segments of the same length with unfiltered discrete-time audio signals;
filtering the discrete-time audio signal in a current segment so as to obtain a filtered discrete-time audio signal wherein lower frequency spectral components are attenuated;
comparing the energy of the filtered discrete-time audio signal in the current segment with the energy of the filtered discrete-time audio signal in a preceding segment; and/or
determining a current relationship between the energy of the filtered discrete-time audio signal in the current segment and the energy of the unfiltered discrete-time audio signal in the current segment and comparing the current relationship with a corresponding preceding relationship; and
detecting a transient on the basis of the comparison performed in the step of determining and/or the comparison performed in the step of comparing;
(b) windowing the discrete-time audio signal with a short window when a transient has been detected and with a long window when no transient has been detected so as to generate blocks of discrete-time audio signals;
(c) transforming the blocks of the discrete-time audio signal from the time domain into the frequency domain so as to generate blocks with spectral components; and
(d) quantizing and coding the blocks of spectral components so as to obtain a coded audio signal.
The present invention is based on the finding that a transient in an audio signal is accompanied by a temporal rise in the signal energy of the audio signal. Furthermore, a transient leads to a rise in the energy of higher frequency signal components in the audio signal, since a transient is typically characterized by rapid temporal changes of the audio signal.
In a preferred embodiment the filtering is performed by means of a high-pass filter; other forms of filtering are possible, however, e.g. by means of a bandpass filter, a differentiator of the first or higher order or similar, provided the filtered discrete-time audio signal differs from the unfiltered discrete-time audio signal in respect of its spectral properties.
The comparison carried out in step (c) of the method in accordance with the first aspect of the present invention serves to detect a temporal rise in the signal energy, i.e. for rise detection, whereas the comparison carried out in step (d) of the method in accordance with the first aspect of the present invention serves to detect the rise of signal components of a particular frequency range, i.e. for spectral detection.
The comparison performed in step (d) of the method in accordance with the first aspect of the present invention serves to take frequency-dependent effects of the temporal masking into account.
It should be pointed out here that the time resolution of the human ear is frequency dependent. Roughly speaking, the time resolution is relatively small at very low frequencies and grows as the frequency increases. In the case of a pre-echo this means that noise introduced by the quantization and causing a pre-echo at a certain time interval prior to a transient will scarcely be detected at low frequencies since the ear has here a time resolution which is coarser than the particular time interval of the pre-echo. The situation is different in the case where a transient occurs in the higher frequency range. Here the time resolution of the human ear is finer, so that a pre-echo at the particular time interval may be audible since the time resolution of the ear may already be finer than the time interval between the pre-echo and the transient. It should be noted therefore that the spectral detection, in contrast to the rise detection, duplicates the frequency-dependent time resolution of the ear, with the result that a more precise transient detection is possible than with the rise detection alone. In some cases the rise detection on its own can, of course, also produce results which are already satisfactory.
It should be noted here that a transient can be detected either on the basis of the comparison made in step (c) of the method in accordance with the first aspect of the present invention or on the basis of the comparison made in step (d) of the method in accordance with the first aspect of the present invention or on the basis of both comparisons.