1. Field of the Invention
The present invention relates generally to the field of digital audio and more specifically, to the field of perceptual coding of digital audio.
2. Background
Perceptual coders analyze the frequency and amplitude content of an input signal and compare it to a model of human auditory perception. Using the model, the encoder removes the irrelevancy of the audio signal. In theory, although the method is lossy, the human perceiver will not hear degradation in the decoded signal. Considerable data reduction is possible. A well-designed perceptually coded recording, with a conservative level of reduction, can rival the sound quality of a conventional recording because the data is coded in a much more intelligent fashion, and because the listener doesn't hear all of what is recorded to begin with. In other words, perceptual coders require only a fraction of the data needed by a conventional system.
Data reduction coders attempt to represent the audio signal at a reduced bit rate while minimizing quantization error. Time-domain coding methods such as delta modulation can be considered to be data-reduction coders. They use prediction methods on samples representing the full bandwidth of the audio signal and yield a quantization error spectrum that spans the audio band. Frequency-domain encoders take a different approach. The signal is analyzed in the frequency domain and coded so that quantization error can be assigned and masked based on psychoacoustic characteristics of the ear. However, coder complexity is greatly increased.
Most low-bit-rate codecs use psychoacoustic models to adaptively quantize only the perceptually significant parts of the signal. Parts of the signal that are below the minimum threhold, or masked by more significant signals, are judged to be inaudible and are not coded.
Amplitude masking occurs when a tone shifts the threshold curve upward in a frequency region surrounding the tone. The masking threshold describes the level where a tone is barely audible. When tones are sounded simultaneously, masking occurs in which louder tones can completely obscure softer tones. For example, a tone of 500 Hz can mask a concurrent softer tone of 600 Hz. The strong sound is called the masker and the softer sound is called the maskee. Masking theory argues that the softer tone is just detectable when its energy equals the energy of the part of the louder masking signal in the critical band; this is a linear relationship with respect to amplitude. Generally, depending on relative amplitude, soft (but otherwise audible) audio tones are masked by louder tones at a similar frequency (within 100 Hz at low frequencies).
Temporal masking occurs when tones are sounded close in time, but not simultaneously. A signal can be masked by a noise or another signal that occurs later. This premasking is sometimes called backward masking. In addition, a signal can be masked by a noise or another signal that ends before the signal begins. This is post masking, sometimes called forward masking. In other words, a louder tone appearing just before (pre-masking), or after (post masking) a softer tone overcomes the softer tone. Just as simultaneous masking increases as frequency differences are reduced, temporal masking increases as time differences are reduced.
Temporal masking decreases as the duration of the masker decreases. In addition, a tone is post masked by an earlier tone when they are close in frequency or when the earlier tone is lower in frequency. Post masking is slight when the masker has a higher frequency. Logically, simultaneous masking is stronger than either pre- or post masking because the sounds occur at the same time.
Temporal masking is important in frequency domain coding. These coders have limited time resolution because they operate on blocks of samples, thus spreading error over time. Temporal masking can overcome audibility of artifacts caused by transient signals. Ideally, filter banks should provide a time resolution of 2 to 4 ms. Acting together, amplitude and temporal masking form a contour that can be mapped in the time-frequency domain.
In subband coding, blocks of consecutive time-domain samples representing the broadband signal are collected over a short period and applied to a digital filter bank. The filter bank divides the signal into multiple bandlimited channels to approximate the critical band response of the human ear.
Each subband is coded independently with greater or fewer bits allocated to the samples in the subband. In any case, quantization noise is increased in each subband. However, when the signal is reconstructed, the quantization noise in a subband will be limited to that subband, where it is masked by the audio signal in each subband. Bit allocation is determined by a psychoacoustic model and analysis of the signal itself. These operations are recalculated for every subband in every new block of data. Samples are dynamically quantized according to audibility of signals, and noise. There is great flexibility in the psychoacoustic models and bit allocation algorithms used in coders that are otherwise compatible. The decoder uses the quantized data to re-form the samples in each block. An inverse synthesis filter bank sums the subband signals to reconstruct the output broadband signal.
A subband perceptual coder uses a digital filter bank to split a short duration of the audio signal into multiple bands. In some designs, a side-chain processor applies the signal to a transform such as an FFT to analyze the energy in each subband. These values are applied to a psychoacoustic model to determine the combined masking curve that applies to the signals in that block. This permits more optimal coding of the time-domain samples. Specifically, the encoder analyzes the energy in each subband to determine which subbands contain audible information. A calculation is made to determine the average power level of each subband over the block. This average level is used to calculate the masking level due to masking of signals in each subband, as well as masking from signals in adjacent subbands. Finally, minimum hearing threshold values are applied to each subband to derive its final masking level. Peak power levels present in each subband are calculated and compared to the masking level. Subbands that do not contain audible information are not coded and in some cases entire subbands can mask nearby subbands which thus need not be coded.