1. Field of the Invention
In general, the invention relates to sound quality assessment of processed audio files, and, more particularly, to evaluation of the sound quality of multi-channel audio files.
2. Description of the Related Art
In recent years, there has been a proliferation of digital media players (e.g., media players capable of playing digital audio files). Typically, these digital media players play digitally encoded audio or video files that have been “compressed” using any number of digital compression methods. Digital audio compression can be classified as ‘lossless’ or ‘lossy’. Lossless data compression allows the recovery of the exact original data that was compressed, while data compressed with lossy data compression yields data files that are different from the source files, but are close enough to be useful in some way. Typically, lossless compression is used to compress data files, such as computer programs, text files, and other files that must remain unaltered in order to be useful at a later time. Conversely, lossy data compression is commonly used to compress multimedia data, including audio, video, and picture files. Lossy compression is useful in multimedia applications such as streaming audio and/or video, music storage, and internet telephony.
The advantage of lossy compression over lossless compression is that a lossy method typically produces a much smaller file than a lossless compression would for the same file. This is advantageous in that storing or streaming digital media is most efficient with smaller file sizes and/or lower bit rates. However, files that have been compressed using lossy methods suffer from a variety of distortions, which may or may not be perceivable to the human ear or eye. Lossy methods often compress by focusing on the limitations of human perception, removing data that cannot be perceived by the average person.
In the case of audio compression, lossy methods can ignore or downplay sound frequencies that are known to be inaudible to the typical human ear. In order to model the human ear, for example, a psychoacoustic model can be used to determine how to compress audio without degrading the perceived quality of sound.
Audio files can typically be compressed at ratios of about 10:1 without perceptible loss of quality. Examples of lossy compression schemes used to encode digital audio files include MPEG-1 layer 2, MPEG-1 Layer 3 (MP3), MPEG-AAC, WMA, Dolby AC-3, Ogg Vorbis, and others.
Objective audio quality assessment aims at replacing expensive subjective listening tests (e.g., panels of human listeners) for audio quality evaluation. Objective assessment methods are generally fully automated, i.e. implemented on a computer with software. The interest in objective measures is driven by the demand for accurate audio quality evaluations, for instance to compare different audio coders or other audio processing devices. Commonly, in a testing scenario, the audio coder or other processing device is called a “device under test” (DUT). FIG. 1 is a block diagram of an audio quality testing setup 100. Reference audio signal 101 is input into the DUT 103. The DUT 103 outputs a processed audio signal 105 (e.g., a digitally compressed audio file or stream that has been restored so that it can be heard). The processed audio signal 105 is then fed into the audio quality tester 107, along with the original reference audio signal 101. In the audio quality tester 107, the processed audio signal 105 is compared to the reference audio signal 101 in order to determine the quality of the processed audio signal 105 output by the DUT 103. A measure of output quality 109 is output by the audio quality tester 107.
Transparent quality, i.e. best quality, is achieved if the processed audio signal 105 is indistinguishable from the reference audio signal 101 by any listener. The quality may be degraded if the processed signal 107 has audible distortions produced by the DUT 103.
Various conventional approaches to audio quality assessment are given by the recommendation outlined in ITU-R, “Rec. ITU-R BS.1387 Method for Objective Measurements of Perceived Audio Quality,” 1998, hereafter “PEAQ”, which is hereby incorporated by reference in its entirety.
PEAQ takes into account properties of the human auditory system. For example, if the difference between the processed audio signal 105 and reference signal 101 falls below the human hearing threshold, it will not degrade the audio quality. Fundamental properties of hearing that have been considered include the auditory masking effect.
However, objective assessment techniques do not employ appropriate measures to estimate deviations of the evoked auditory spatial image of a multi-channel audio signal (e.g., 2-channel stereo, 5.1 channel surround sound, etc.). Spatial image distortions are commonly introduced by low-bit rate audio coders, such as MPEG-AAC or MPEG-Surround. MPEG-AAC, for instance, provides tools for joint-channel coding, for instance “intensity stereo coding” and “sum/difference coding”. The potential coding distortions caused by joint-channel coding techniques cannot be appropriately estimated by conventional assessment tools such as PEAQ simply because each audio channel is processed separately and properties of the spatial image are not taken into account.
FIG. 2 is a block diagram of the PEAQ quality assessment tool, which only supports 1 channel mono or 2-channel stereo audio. More than 2 channels are not supported.
The objective quality assessment tool 200 implements PEAQ above is divided into two main functional blocks as shown in FIG. 2. The first block 201 is a psychoacoustic model, which acts as a distortion analyzer. This block compares corresponding monaural or stereophonic channels of a reference signal 203 and a test signal 205 and produces a number of Model Output Variables (MOVs) 207. Both the reference signal 203 and the test signal 205 can be any number of channels, from monaural to multi-channel surround sound. The MOVs 207 are specific distortion measures; each of them quantifies a certain type of distortion by one value per channel. These values are subsequently averaged over all channels and output to the second major block, a neural network 209. The neural network 209 combines all MOVs 207 to derive an objective audio quality 211.
In PEAQ, since the distortions are independently analyzed in each audio channel, there is no explicit evaluation of auditory spatial image distortion. For many types of audio signals this lack of spatial image distortion analysis can cause inaccurate objective quality estimations, leading to unsatisfactory quality assessments. Thus, an audio signal may have a high quality rating according to the PEAQ standard, yet have severe spatial image distortions. This is highly undesirable in the case of high fidelity or high definition sound recordings where spatial cues are crucial to the recording, such as multi-channel (i.e., two or more channels) sound systems.
Accordingly, there is a demand for objective audio quality assessment techniques capable of evaluating spatial as well as other audio distortions in a multi-channel audio signal.