A Perceptual audio coder is an apparatus that takes series of audio samples as input and compresses them to save disk space or bandwidth. The Perceptual audio coder uses properties of the human ear to achieve the compression of the audio signals.
The technique of compressing audio signals involves recording an audio signal through a microphone and then converting the recorded analog audio signal to a digital audio signal using an A/D converter. The digital audio signal is nothing but a series of numbers. The audio coder transforms the digital audio signal into large frames of fixed-length. Generally, the fixed length of each large frame is around 1024 samples. The analog signal is sampled at a specific rate (called the sampling frequency) and this results in a series of audio samples. Typically a frame of samples is a series of numbers. The audio coder can only process one frame at a time. This means that the audio coder can process only 1024 samples at a time. Then the audio coder transforms the received fixed-length frames (1024 samples) into a corresponding frequency domain. The transformation to a frequency domain is accomplished by using an algorithm, and the output of this algorithm is another set of 1024 samples representing a spectrum of the input. In the spectrum of samples, each sample corresponds to a frequency. Then the audio coder computes masking thresholds from the spectrum of samples. Masking thresholds are nothing but another set of numbers, which are useful in compressing the audio signal. The following illustrates the computing of masking thresholds.
The audio coder computes an energy spectrum by squaring the spectrum of the 1024 samples. Then the samples are further divided into series of bands. For example, the first 10 samples can be one band and the next 10 samples can be another subsequent band and so on. Note that the number of samples (width) in each band varies. The width of the bands is designed to best suit the properties of the human ear for listening to frequencies of sound. Then the computed energy spectrum is added to each of the bands separately to produce a grouped energy spectrum.
The audio coder applies a spreading function to the grouped energy spectrum to obtain an excitation pattern. This operation involves simulating and applying the effects of sounds in one critical band to a subsequent (neighboring) critical band. Generally this step involves convolution with a spreading function, which results in another set of fixed numbers.
Then, based on the tonal or noise-like nature of the spectrum in each critical band, a certain amount of frequency-dependent attenuation is applied to obtain initial masking threshold values. Then, by using an absolute threshold of hearing, the final masked thresholds are obtained. Absolute threshold of hearing is a set of amplitude values below which the human ear will not be able to hear.
Then the audio coder combines the initial masking threshold values with the absolute threshold values to obtain the final masked threshold values. Masked threshold value means a sound value below which a sound is not audible to the human ear (i.e., an estimate of maximum allowable noise that can be introduced during quantization).
Using the masked threshold values, the audio coder computes perceptual entropy (PE) of a current frame. The perceptual entropy is a measure of the minimum number of bits required to code a current frame of audio samples. In other words, the PE indicates how much the current frame of audio samples can be compressed. Various types of algorithms are currently used to compute the PE.
The audio coder receives the grouped energy spectrum, the computed masking threshold values, and the PE and quantizes (compresses) the audio signals. The audio coder has only a restricted number of bits allocated for each frame depending on a bit rate. It distributes these bits across the spectrum based on the masking threshold values. If the masking threshold value is high, then the audio signal is not important and is hence represented using a smaller number of bits. Similarly, if masking threshold is low, the audio signal is important and hence represented using a higher number of bits. Also, the audio coder checks to ensure that the allocated number of bits for the audio signals is not exceeded. The audio coder generally applies a two-loop strategy to allocate and monitor the number of bits to the spectrum. The loops are generally nested and are called Rate Control and Distortion Control Loops. The Rate Control Loop controls the distribution of the bits not to exceed the allocated number of bits, and the Distortion control loop does the distribution of the bits to the received spectrum. Quantization is a major part of the perceptual audio coder. The performance of the audio coder can be significantly improved by reducing the number of calculations performed in the control loops. The current quantization algorithms are very computation intensive and hence result in a slower operation.
Earlier we have seen that the audio coder receives one frame of samples (1024 samples in length) as input and converts the frame of samples into a spectrum and then quantizes using masking thresholds. Sometimes the input audio signal may vary quickly (when the properties of a signal change abruptly). For example, if there is a sudden heavy beat in the audio signal, and if the audio coder receives a frame of 1024 samples in length (including the heavy beat) due to inadequate temporal masking in a signal including abrupt changes, a problem called pre-echo can occur. This is because the sound signal contains error after quantization, and this error can result in an audible noise before the onset of the heavy beat, hence called the pre-echo. Heavy beats are also called ‘attacks.’ A signal is said to have an attack if it exhibits a significant amount of non-stationarity within the duration of a frame under analysis. For example, sudden increase in amplitudes of a time signal within a typical duration of analysis is an attack. To avoid this problem the audio signal is coded with frames having smaller frame lengths instead of the long 1024 samples. To keep continuity in the number of samples given as input usually 8 smaller blocks of 128 samples are coded (8×128 samples=1024 samples). This will restrict the heavy beat to one set of 128 samples among 8 smaller blocks, and hence the noise introduced will not spread to the neighboring smaller blocks as pre-echo. But the disadvantage of coding in 8 smaller blocks of 128 samples is that they require more bits to code than required by the larger blocks of 1024 samples in length. So the compression efficiency of the audio coder is significantly reduced. To improve the compression efficiency, the heavy beats have to be detected accurately so that the smaller blocks can be applied only around the heavy beats. It is important that the heavy beats be accurately detected, or else pre-echo can occur. Also, a false detection of heavy beats can result in significantly reduced compression efficiency. Current methods to detect the heavy beats use the PE. Calculating the PE is computationally very intensive and also not very accurate.
Also, we have seen earlier that the blocks that have attacks should be coded as smaller blocks having 128 samples and others as larger blocks having 1024 samples. The smaller frame lengths of 128 samples are called ‘short-blocks’, and the 1024 samples frame length are called ‘long-blocks.’ We have also seen that the short-blocks require more bits to code than the long-blocks. Also for each large frame there is a fixed number of bits allocated. If we can intelligently save some bits while coding a long-block and use the saved bits in a short-block, the compression efficiency of the audio coder can be significantly increased. For storing the bits, a ‘Bit Reservoir mechanism’ is needed. Since long-blocks do not need a large number of bits, the unused bits from the long-blocks can be saved in the bit reservoir and used later for a short-block. Currently there are no efficient techniques to save and allocate bits between long and short-blocks to improve the compression efficiency of the audio coder.
The audio signal can be of two types (i) single channel or mono-signal and (ii) multi-channel or stereo signal to produce spatial effects. The stereo signal is a multi-channel signal comprised of two channels, namely left and right channels. Generally the audio signals in the two channels have a large correlation between them. By using this correlation the stereo channels can be coded more efficiently. Instead of directly coding the stereo channels, if their sum and difference signals are coded and transmitted where the correlation is high, a better quality of sound is achieved at a same bit rate. When the audio signal is a stereo signal, the audio coder can operate in two modes (a) normal mode and (b) M-S mode. The M-S mode means encoding the sum and difference of the left and right channels of the stereo. Currently the decision to switch between the normal and M-S modes is based on the PE. As explained before, computing PE is very computation intensive and inconsistent.
Therefore, there is a need in the art for a computationally efficient quantization technique. Also, there is a need in the art for an improved attack detection technique that is computationally less intensive and more accurate, to improve the compression efficiency of the audio coder. In addition, there is a need in the art for a technique to allocate the bits between the long and short-blocks to improve the computation efficiency of the audio coder. Furthermore, there is also a need in the art for a technique that is computationally efficient and more accurate in switching between the normal and the M-S modes when the audio signal is a stereo signal.