For compression of audio signals as well as for automatic audio quality assessment, perceptional models are typically employed to estimate the audibility of signal distortions. (See, e.g., U.S. Pat. No. RE36714, “Perceptual Coding of Audio Signals”, issued to K. Brandenburg et al. U.S. Pat. No. RE36714, which is commonly assigned to the assignee of the present invention, is hereby incorporated by reference as if fully set forth herein.) Typical realizations of such a perceptual model are also described, for example, in various standards for audio coding (See, e.g., ISO/IEC JTC1/SC29/WG11, “Coding of Moving Pictures and Audio—MPEG-2 Advanced Audio Coding AAC”, ISO/IEC 13818-7 International Standard, 1997.) and in certain standards for audio quality assessment (See, e.g., ITU-R, “Method for Objective Measurement of Perceived Audio Quality,” Rec. ITU-R BS.1387, Geneva, 1998.), each of which are fully familiar to those of ordinary skill in the art.
A crucial part of these perceptual models is the spectral decomposition of the acoustic signal into band-pass signals. In perceptual audio coding applications, for example, the audio signal is treated as a masker for distortions introduced by lossy data compression. For this purpose, the masked thresholds are approximated by a perceptual model. As a first processing step, a spectral decomposition of the acoustic signal is performed so that a set of masked thresholds corresponding to the various frequency ranges may be derived.
In particular, a spectral decomposition used for this purpose should advantageously mimic the corresponding properties of the human auditory system—specifically, the frequency selectivity and temporal resolution which results from the corresponding spectral decomposition process which is part of the signal processing performed inside the human cochlea. The cochlea provides band-pass filtered versions of the input signal that are subsequently transduced into neural signals by the inner hair cells. The associated band-pass filters have increasing bandwidth with increasing center frequency and an asymmetric frequency response. However, currently used spectral decomposition schemes for masking modeling in audio coding or audio quality assessment, for example, generally do not achieve the non-uniform time and frequency resolution provided by the cochlea. These applications rather take advantage of the computational efficiency of uniform filter banks or transforms at the expense of coding gain.
As is well known to those of ordinary skill in the art, a time-to-frequency transform is one very efficient way to compute a spectral decomposition. For example, the perceptual models in both the above referenced MPEG-2 audio coding standard and in the basic version of the above referenced quality assessment standard each use the Fast Fourier Transform (FFT), which is fully familiar to those of ordinary skill in the art. The FFT provides constant spectral and temporal resolution over frequency. However, the auditory filters of the cochlea have increasing bandwidth and temporal resolution with increasing center frequency. This non-uniform spectral resolution of the auditory system is usually taken into account by summing up the energies of an appropriate number of neighboring FFT frequency bands. However, the phase relation between spectral components within an auditory filter band is not taken into account by such a summation of energies. And the temporal resolution of the spectral decomposition is determined by the transform size and is thus constant across all auditory bands. This results in a significantly lower temporal resolution at high center frequencies in comparison with the corresponding auditory filters. These deviations lead to inaccurate modeling of masking and sub-optimal coding gain.
The “Advanced Model” of the above referenced quality assessment standard, on the other hand, replaces the FFT by a filter bank of band-pass filters which have a larger bandwidth at higher center frequencies. More specifically, each of a set of 40 critical band filter pairs is realized as a Finite Impulse Response (FIR) filter, wherein the output of each filter pair is a critical band signal and its (90 degree phase shifted) Hilbert transform, which is advantageously downsampled by a factor of 32. (FIR filters and Hilbert transforms are both fully familiar to those of ordinary skill in the art.) The appropriate auditory filter slopes are created by spectral convolution with a spreading function. This complex convolution advantageously increases the temporal resolution of the original filters, but the filter bank is computationally complex and the linear phase response is not in line with the auditory system. Furthermore, the downsampling can create aliasing distortions in the high frequency bands.
For the above reasons, it would be highly desirable to provide a spectral decomposition scheme which provides improved masking modeling for perceptual audio coding applications (for example), and which does so at relatively low computational costs. In particular, it would be desirable to provide a method and apparatus for performing a spectral decomposition which is suitable for achieving the time and frequency resolution necessary to simulate psychophysical data closely related to cochlear spectral decomposition properties, and which overcomes the drawbacks of prior art approaches.