As hearing-adapted digital coding methods have been standardized for some years (Kh. Brandenbrug and G. Stoll, The iso/mpeg-audio codec: A generic standard for coding of high quality digital audio, 92nd AES-Convention, Vienna, 1992, Preprint 3336), these are being employed in increasing manner. Examples hereof are the digital compact cassette (DCC), the minidisk, digital terrestrial broadcasting (DAB; DAB=Digital Audio Broadcasting) and the digital video disk (DVD). The disturbances known from analog transmissions as a rule are no longer present in digital uncoded audio signal transmission. Measurement technology can be confined to the transition from analog to digital and vice versa, if no coding of the audio signals is carried out.
In case of coding by means of hearing-adapted coding methods, however, audible artificial products or artifacts may occur that have not occurred in analog audio signal processing.
Known measurement values, such as e.g. the harmonic distortion factor or the signal-to-noise ratio, cannot be employed for hearing-adapted coding methods. Many hearing-adapted coded music signals have a signal-to-noise ratio of below 15 dB, without audible differences to the uncoded original signal being perceivable. In opposite manner, a signal-to-noise ratio of more than 40 dB may already lead to clearly audible disturbances.
In recent years, various hearing-adapted measuring methods were introduced, of which the NMR method (NMR=Noise to Mask Ratio) is to be mentioned (Kh. Brandenburg and Th. Sporer. "NMR" and "Masking Flag": Evaluation of quality using perceptual criteria. In Proceedings of the 11th International Conference of the AES, Portland, 1992).
In an implementation of the NMR method, a discrete Fourier transform of the length 1024 and using a Hann window with an advancing speed of 512 sampling values for an original signal and for a differential signal, is calculated between the original signal and a processed signal each. The spectral coefficients obtained therefrom are combined in frequency bands the width of which corresponds approximately to the frequency groups suggested by Zwicker in E. Zwicker, Psychoacoustics, publisher Springer-Verlag, Berlin Heidelberg N.Y., 1982, whereupon the energy density of each frequency band is determined. From the energy densities of the original signal, an actual masking or covering threshold is determined in consideration of the masking within the respective frequency group, the masking between the frequency groups and the post-masking for each frequency band, with said masking threshold being compared with the energy density of the differential signal. The resting threshold of the human ear is not fully considered since the input signals of the measuring method cannot be identified with fixed listening loudnesses, as a listener of audio signals usually has access to the loudness of the piece of music or audio piece he wants to listen to.
It has turned out that the NMR method, for example, in case of a typical sampling rate of 44.1 kHz, has a frequency resolution of about 43 Hz and a time resolution of about 23 ms. The frequency resolution is too low in case of low frequencies, whereas the time resolution is too low in case of high frequencies. Nevertheless, the NMR method displays a good reaction to many time effects. When a sequence of beats, such as e.g. drum beats, is sufficiently low, the block prior to the beat still has very low energy, so that a possibly occurring pre-echo can be recognized exactly. The advancing speed of 11.6 ms for the analysis window permits the recognition of many pre-echoes. However, when the analysis window has an unfavorable position, a pre-echo may remain unrecognized.
The difference between masking by tonal signals and by noise is not taken into consideration in the NMR method. The masking curves employed are empirical values obtained from subjective hearing tests. To this end, the frequency groups are located at fixed positions within the frequency spectrum, whereas the ear forms the frequency groups dynamically around particularly prominent sound events in the spectrum. Thus, more correct would be a dynamic arrangement about the centers of the energy densities. Due to the width of the fixed frequency groups, it is not possible to distinguish, for example, whether a sinusoidal signal is located in the center or at an edge of a frequency group. The masking curve thus is based on the most critical case, i.e. the lowest masking effect. The NMR method therefore sometimes indicates disturbances that cannot be heard by a human being.
The already mentioned low frequency resolution of only 43 Hz constitutes a limit to a hearing-adapted quality assessment of audio signals by means of the NMR method in particular in the lower frequency range. This has a particularly disadvantageous effect in the assessment of low-pitched voice signals, as produced for example by a male speaker, or sounds of very low-pitched instruments, such as e.g. a bass trombone.
For providing a better understanding of the present invention, some important psychoacoustic and cognitive fundamentals for the hearing-adapted quality assessment of audio signals will be indicated in the following. The most important term in the field of hearing-adapted coding and measuring technology is the term "Verdeckung"(=masking) which by analogy with the English term "masking" often is also referred to as "Maskierung". A discretely occurring, perceivable sound event of low loudness is masked by a louder sound event, i.e. it is no longer perceived in the presence of the second, louder sound event. The masking effect is dependent both upon the time structure and upon the spectral structure of the masker (i.e. the masking signal) and the masked signal.
FIG. 1 is to illustrate the masking of sounds by narrow-band noise signals 1, 2, 3 at 250 Hz, 1,000 Hz and 4,000 Hz and a sound pressure level of 60 dB. FIG. 1 is taken from E. Zwicker and H. Fastl, Concerning the dependency of post-masking on disturbance pulse duration, in Acustica, Vol. 26, pages 78 to 82, 1982.
The human ear in this respect can be regarded as a bank of filters consisting of a large number of mutually overlapping band-pass filters. The distribution of these filters over the frequency is not constant. In particular, with low frequencies the frequency resolution is clearly better than with high frequencies. When looking at the smallest perceivable frequency difference, this value is about 3 Hz at frequencies below about 500 Hz, and above 500 Hz increases in proportion to the frequency or center frequency of the frequency groups. When the smallest perceivable differences are juxtaposed on the frequency scale, 640 perceivable stages are obtained. A frequency scale that is adapted to the frequency sensation of human beings is constituted by the bark scale. The latter subdivides the entire audible range up to about 15.5 KHz into, 24 sections.
Due to the overlapping of filters of finite steepness, audio signals of low loudness in the vicinity of loud audio signals are masked. Thus, in FIG. 1 all sinusoidal audio signals present below the illustrated narrow-band noise curves 1, 2, 3, which in the spectrum are represented as an individual line, are masked and thereby are not audible.
The edge steepness of the individual masking filters of the bank of filters in the human ear, as assumed in the model, furthermore is dependent upon the sound pressure level of the signal heard and to a lesser extent on the center frequency of the respective band filter. The maximum masking is dependent upon the structure of the masker and is about -5 dB in case of masking by noise. In case of masking by sinusoidal sounds, the maximum masking is considerably lesser and, depending on the center frequency, is -14 to -35 dB (cf. in M. R. Schroeder, B. S. Atal and J. L. Hall, Optimizing digital speech coders by exploiting masking properties of the human ear, The Journal of the Acoustic Society of America, Vol. 66 (No. 6), pages 1647 to 1652, December 1979.
The second important effect is masking in terms of time, which is to be elucidated with the aid of FIG. 2. Immediately after, but also immediately prior to a loud sound event, sound events of lower loudness are not perceived. The masking in terms of time is highly dependent on the structure and the duration of the masker (cf. H. Fastl, Thresholds of masking as a measure for the resolution capacity of the human ear in terms of time and spectrum. Dissertation, faculty for mechanical and electrotechnical engineering of the Technical University of Munich, Munich, May 1974). Post-masking may have a duration of up to 100 ms in particular. The greatest sensitivity and thus the shortest masking effect occurs in the masking of noise by Gaussian pulses. With this, pre-masking and post-masking are only about 2 ms.
With a sufficiently great distance from the masker or from 4 in FIG. 1, the masking curves change into a resting threshold 5. At the beginning and at the end of a masking signal, the masking curves during pre-masking 6 and post-masking 7, respectively, change into simultaneous masking 8. FIG. 2 is taken in essence from E. Zwicker, Psychoacoustics, publisher Springer-Verlag, Berlin Heidelberg N.Y., 1982.
The pre-masking effect is explained by the different-velocity processing of signals on their way from the ear to the brain and in the brain, respectively. Large stimuli, i.e. sound events of great loudness or sound events with a high sound pressure level (SPL) are passed on faster than small ones. A loud sound event therefore, so to speak, can "take over" and thus mask a sound event of lower loudness preceding the same in time.
Post-masking corresponds to a "recovery time" of the sound receptors and the transmission of stimuli, in which in particular the decomposition of messenger substances at the nervous synapses would have to be indicated.
The masking extent or the degree of masking is dependent on the structure of the masker, i.e. the masking signal, both in terms of time and spectrum. Pre-masking is shortest (about 1.5 ms) with pulse-like maskers and considerably longer (up to 15 ms) in case of noise signals. After 100 ms, post-masking reaches the resting threshold. As regards the exact configuration of the post-masking curve, the literature makes different statements. Thus, in a particular case, post-masking in case of noise signals may differ between 15 to 40 ms. The values indicated hereinbefore each constitute minimum values for noise. New investigations with Gaussian pulses as maskers show that for such signals post-masking also takes place within a range of 1.5 ms (J. Spille, Measurement of pre- and post-masking in pulses under critical conditions, Internal Report, Thomson Consumer Electronics, Hannover, 1992). In case both maskers and disturbance signals are band-limited by means of a low-pass filter, both pre-masking and post-masking become longer.
Masking in time plays an important role in the assessment of audio coding methods. When the operation is of block-type, which holds for most cases, and when there are actions in the block, disturbances may possibly be caused prior to the action, which are above the level of the useful signal level. These disturbances possibly are masked by a pre-masking effect. However, in case such a disturbance is not masked, the effect arising is referred to as "pre-echo". Pre-echoes as a rule are not perceived separately from the action, but as a sound coloration of the action.
The resting threshold (4 in FIG. 1) results from the frequency response of external and middle ear and by the superimposition of the sound signals having reached the inner ear with the basic noise caused by the blood flow, for example. This basic noise and the resting threshold, which is not constant in the frequency range, thus mask sound events of very low loudness. FIG. 1 reveals in particular that a good sense of hearing may perceive a frequency range from 20 Hz to 18 kHz.
The subjectively perceived loudness of a signal is very much dependent on its spectral composition and its composition in time. Portions of a signal may mask other portions of the same signal, in such a manner that they no longer contribute to the hearing impression. Signals close to the listening threshold (i.e. signals that just are still perceivable) are perceived to be less loud than corresponds to their actual sound pressure level. This effect is referred to as "choking"(E. Zwicker and R. Feldtkeller, The ear as recipient of messages, publisher Hirzel-Verlag, Stuttgart, 1967).
Furthermore, there are cognitive effects playing a role in the assessment of audio signals. In particular, a five-stage so-called "impairment scale" (impairment=deterioration) has established itself. It is the task of human test persons to make, in a double blind test, assessments for two signals, one thereof being the original signal that has not been coded and decoded, whereas the other signal is a signal obtained after coding and subsequent decoding. The hearing test uses three stimuli A, B, C, in which signal A always is the reference signal. A person performing the hearing test always compares the signals B and C to A. In this respect, the uncoded signal is referred to as reference signal, whereas the signal derived by coding and decoding from the reference signal is referred to as test signal. In the assessment of clearly audible disturbances, there are thus not only psychoacoustic effects playing a role, but also cognitive or subjective effects.
In the assessment of audio signals by human listeners, cognitive effects have considerable influence on the assessment by means of the impairment scale. Discrete, very strong disturbances often are perceived by many test persons as less disturbing than permanently present disturbances. However, starting from a specific number of such strong disturbances, they dominate the quality impression. Systematic investigations in this respect are not known from the literature.
Although the perception thresholds of different listeners are hardly different in psychoacoustic tests, various artifacts are perceived by different test persons in differently grave manner. While some test persons perceive restrictions in bandwidth to be less disturbing than noise modulations at high frequencies, this is felt exactly in the opposite manner by other test persons.
The assessment scales of various test persons are clearly different from each other. Many listeners tend to rate clear audible disturbances as grade 1 ("very disturbing"), while they hardly assign average grades. Other listeners often assign average grades (Thomas Sporer, Evaluating small impairments with the mean opinion scale--reliable or just a guess? In 101nd AES-Convention, Los Angeles, 1996, Preprint).
DE 44 37 287 C2 discloses a method of measuring the maintenance of stereophonic audio signals and a method of recognizing commonly coded stereophonic audio signals. A signal to be tested, having two stereo channels, is formed by coding and subsequent decoding of a reference signal. Both the signal to be tested and the reference signal are transformed to the frequency range. For each partial band of the reference signal and for each partial band of the signal to be tested, signal characteristics are formed for the reference signal and for the signal to be tested. The signal characteristics belonging to the same partial band each are compared with each other. From this comparison, conclusions are made with respect to the maintenance of stereophonic audio signal properties or the disturbance of the stereo sound impression in the coding technique used. Subjective influences on the reference signal and the signal to be tested, due to the transmission properties of the human ear, are not taken into consideration in this publication.
DE 4345171 discloses a method of determining the coding type to be selected for coding at least two signals. A signal having two stereo channels is coded by intensity stereo coding and decoded again in order to be compared with the original stereo signal. The intensity stereo coding is to be used for audio coding proper of the stereo signal when the left-hand and right-hand channels are very similar to each other. The coded/decoded stereo signal and the original stereo signal are transformed from the time domain to the frequency domain by a transformation method with unlike time resolution and frequency resolution. This transformation method comprises a hybrid/polyphase filter bank through which similar spectral lines are generated, for example, by means of an FFT or MDCT. By selecting a scale factor bandwidth that increases as of a specific limit frequency, the frequency group width and the related time resolution of the human sense of hearing is to be simulated. Subsequently, the short-time energies are formed in the respective frequency group bands by squaring and summation both of the original stereo signal and of the coded/decoded stereo signal. The short-time energy values thus obtained are assessed using the psychoacoustic listening threshold in order to take only the audible short-time energy values into further consideration for considering the psychoacoustic masking effects in the assessment whether intensity stereo coding makes sense. This assessment of the short-time energy values of the frequency group bands can be extended, furthermore, by modelling of the human inner ear, so as to consider the non-linearites of the human inner ear as well.