(1) Field of the Invention
The present invention relates to an apparatus and method for encoding audio signals. More particularly, the present invention relates to an apparatus and method for encoding audio signals for use in the fields of data communications such as mobile phone networks and the Internet, digital televisions and other broadcasting services, and audio/video recording and storage devices using MD, DVD, and other media.
(2) Description of the Related Art
Recent years have seen a growing need for audio coding techniques enabling efficient compression of audio signals, as a result of rapid proliferation of Internet communications and digital terrestrial broadcasting services, as well as widespread use of DVD, digital audio players, and other audio/video appliances.
Adaptive transform coding is used as a mainstream method for audio coding. This technique exploits the characteristics of the human hearing system to compress data by reducing redundancy of acoustic information and suppressing imperceptible sound components.
The basic process flow of adaptive transform coding includes the following steps:                transforming an audio signal from time domain to frequency domain        partitioning the frequency-domain signals into multiple frequency bands according to the frequency resolution of human hearing        calculating an optimal data bandwidth for encoding signal components in each frequency band, based on the human hearing characteristics        quantizing the frequency-domain signals according to the data bandwidth assigned to each frequency band        
Among the available techniques of adaptive transform coding, MPEG2 AAC is particularly of interest in recent years, where MPEG2 stands for “Moving Pictures Experts Group-2” and AAC “Advanced Audio Coding.” MPEG AAC is used, for example, in terrestrial digital broadcasting systems. The International Standardization Organization/International Electro technical Commission (ISO/IEC) has standardized the MPEG2 AAC technology (hereafter simply “AAC”) as ISO/IEC 13818-7, Part 7, titled “Advanced Audio Coding” (AAC).
The AAC encoder samples a given analog audio signal in the time domain and partitions the resulting series of digital values into frames each consisting of a predetermined number of samples.
One frame may be processed as a single LONG block with a length of 1024 samples or as a series of SHORT blocks with a length of 128 samples. The selection of which block length to use is made in an adaptive manner, depending on the nature of audio signals. Audio signals are encoded on an individual block basis.
FIG. 8 shows the relationship between LONG blocks and SHORT blocks. One frame contains 1024 samples. A LONG block is the entire span of such a frame. A SHORT block is one eighth of the frame, thus containing 128 samples.
Accordingly, the encoder processes audio signals in units of frames in the case where LONG block is selected, and in units of eighth frames in the case where SHORT block is selected.
FIG. 9 shows an overview of a conventional AAC encoder. This AAC encoder 100 is formed from an acoustic analyzer 101, a block length selector 102, and a coder 103.
The acoustic analyzer 101 subjects an input signal to a Fast Fourier Transform (FFT) analysis to obtain an FFT spectrum. Then the acoustic analyzer 101 calculates perceptual entropy from the FFT spectrum and passes it to the block length selector 102. Perceptual entropy is a parameter indicating the number of bits required for quantization.
The block length selector 102 selects SHORT block if the received perceptual entropy exceeds a predetermined threshold (constant), and it selects LONG block if the perceptual entropy does not exceed the threshold.
In the case where the block length selector 102 has selected LONG block for coding a frame of the input signal, the coder 103 encodes that frame on a LONG block basis. In the case where SHORT block is selected, the coder 103 encodes the frame on a SHORT block basis.
The coding process applies an orthogonal transform to each single frame on a LONG block basis or a SHORT block basis. The resulting orthogonal transform coefficients are then quantized for each frequency band, within a limit of an allocated number of bits, thus producing an output bitstream for transmission.
In the case where the input frame is a stationary signal having little variations in its amplitude and frequency as in the case of sine waves, it is advantageous to encode the frame as a LONG block (i.e., encode the entire frame as a single unit of data) since such a signal with little variations does not require a large data bandwidth. That is, a series of signal sections can be encoded efficiently by processing them as a single section if their amplitude and frequency do not vary much.
Since the number of quantized bits will not be large in stationary sections, a frame carrying such stationary signals has a small perceptual entropy (parameter indicating the number of bits required for quantization) falling below the threshold. The coding process thus decides to encode the frame as a LONG block.
In contrast to the above, there may be a frame carrying a signal with a steep change in its amplitude or frequency. If a frame containing such a signal (referred to hereafter as an “attack sound”) is encoded as a LONG block, the resulting coded sound signal would have an artifact called “pre-echo” and consequent quality degradation.
The following section will discuss the problem of pre-echoes with reference to FIGS. 10 to 12, where the horizontal axis represents time and the vertical axis represents amplitude. FIG. 10 shows a source input signal containing an attack sound. Specifically, this input signal frame f1 contains both an attack sound and stationary signal components.
FIG. 11 illustrates a pre-echo appearing in a decoded sound (frame f1a) in the case where the frame f1 is encoded as a single LONG block. The frame f1 contains both an attack sound and a stationary signal, the components being quite distinct from each other. This frame f1 is encoded as a LONG block and quantized in the frequency domain. As FIG. 11 shows, the resulting signal has a significant quantization noise (appearing as fine distortions) across the entire frame f1, which is derived from the attack sound.
The quantization error appearing before the attack sound can be heard by the user as a grating noise called a pre-echo, which causes degradation of sound quality. The attack sound section is also affected by the quantization error. This is, however, masked by the attack sound itself, hardly causing noticeable problems.
The quantization error further appears as a noise signal after the attack sound section, which is called “post-echo.” The human hearing system, however, does not perceive such short-period noise after a loud sound. For this reason, post-echoes are not perceived as a problem in most cases.
It is pre-echoes that is audible to human ears and eventually deteriorates the sound quality. The audio coding process thus places importance on how to suppress pre-echoes.
FIG. 12 shows a decoded sound whose source signal has been encoded as SHORT blocks. Pre-echoes are suppressed since the frame f1 has been encoded as SHORT blocks. While block b contains an attack sound, the resulting quantization error is confined within that block b, without affecting any other blocks. This is why the SHORT-block encoding can suppress pre-echoes.
The coding process thus decides to encode a frame as SHORT blocks when it contains a steeply changing signal such as an attack sound, thereby suppressing pre-echoes. Specifically, the attack-containing frame exhibits a large perceptual entropy exceeding a threshold since the attack sound produces a larger number of quantized bits when it is encoded. This large perceptual entropy causes the coding process to choose SHORT-block encoding.
As an example of an existing technique, Japanese Patent Application Publication No. 2005-3835 (paragraph Nos. 0028 to 0045, FIG. 1) proposes an audio coding technique to produce a bitstream with suppressed pre-echoes.
Most audio coding devices including AAC encoders have a bit reservoir function to implement pseudo-variable bitrate control to absorb fluctuations in the number of quantized bits.
FIG. 13 shows the concept of how a bit reservoir works. Graph G1 in this figure shows how many bits are used to quantize frames, where the horizontal axis represents a sequence of frames and the vertical axis represents the number of quantized bits consumed by each frame. Graph G2, on the other hand, shows how many bits remain unused in the bit reservoir when each frame is quantized, where the horizontal axis represents a sequence of frames and the vertical axis represents the number of reserve bits.
It is assumed here that the average number of quantized bits is set to 100 bits. The average number of quantized bits is a parameter used to determine the number of available bits, and it is calculated in accordance with transmission bitrates.
The number of bits required to represent a quantized frame may fall below or exceed the average number of quantized bits. In the former case, their difference is accumulated as available bits. In the latter case, the exceeding bits are supplied from the pool of available bits.
As can be seen from the figure, frame #1 is encoded into 100 quantized bits, which is equal to the average number of quantized bits. This means that there will be no more available bits. Frame #2 is, on the other hand, encoded into 80 quantized bits, which is 20 bits smaller than the average number of quantized bits. Accordingly, the available bits amount to 20 (=100−80).
Frame #3 is now encoded into 70 quantized bits. The number of available bits is now 50 (=100−70+20), including those not spent by frame #2.
Frame #4 is then encoded into 120 quantized bits, exceeding the average number of quantized bits by 20. In such a case, the excessive 20 bits are withdrawn from the pool of 50 available bits at the time of frame #3. The number of available bits thus decreases to 30 (=50−20). The subsequent frames are assigned an appropriate number of bits in the same way to absorb the fluctuations, thus achieving a variable bitrate control.
Suppose now that frames #2 and #3 are encoded as LONG blocks while frame #4 is encoded as SHORT blocks. LONG-block coding tends to leave more available bits since they require a smaller number of bits when they are quantized.
SHORT-block coding, on the other hand, requires a larger number of bits for quantization, thus consuming the available bits that have accumulated during the time of LONG-block coding.
Some circumstances may accept low compression ratios and allow the use of many bits for quantization. In such high-bitrate conditions, the encoder can select SHORT block for a frame containing an attack sound or a large variation exhibiting a high perceptual entropy. The SHORT-block coding suppresses pre-echoes, as well as permitting the bit reservoir to raise the average number of quantized bits. This means that the encoder is free from bit starvation in such conditions.
Other circumstances do not allow the use of many bits for quantization and thus requires high compression ratios. In such low-bitrate conditions, the bit reservoir has to operate with a smaller average number of quantized bits (i.e., it is not allowed to use many bits). Selecting SHORT-block coding because of a large perceptual entropy would use up available bits, soon falling into bit starvation. This results in a significant degradation of sound quality.
Quality degradation due to bit starvation is perceived to be more annoying than that of pre-echoes. That is, the sound degradation becomes worse in this situation despite the fact that SHORT blocks are selected to suppress pre-echoes in a frame containing a large variation like an attack sound.
Meanwhile, recent years have seen the emergence of a new broadcasting service whose bitrate is as low as 96 kbps to deliver stereo signals with a sampling rate of 48 kHz (at a compression ratio of 1/16 or a higher compression ratio). One example is the terrestrial digital broadcasting for mobile phones, which is known as “one segment broadcasting” service.
Without compression, transmission of 48-kHz sampled stereo signals requires a bandwidth of 1,536 kbps (48,000×16×2) since 48,000 samples of two 16-bit channels have to be transmitted per second. One sixteenth of 1,536 kbps is 96 kbps. Generally the CD-quality audio signals sampled at 44.1 kHz are compressed to about 128 kbps for use with player equipment using the MPEG Audio Layer 3 (MP3) format. The aforementioned terrestrial digital broadcasting for mobile phones requires even lower bitrates, e.g., 96 kbps. The compression ratios required in those applications are so high that the encoder faces difficulties in preventing sound quality degradation.
Audio signals may include a large transient component (e.g., attack sound) or a continuously varying component. If this is the case, broadcasting and communications services operating in a low-bitrate condition could encounter a sudden exhaustion of usable bits as a result of increased consumption of available bits in a bit reservoir.
Bit starvation during the process of encoding bit-consuming SHORT blocks will greatly reduce the performance of the encoder, thus spoiling the sound quality more than pre-echoes would do.
For this reason, the conventional AAC encoders used in digital terrestrial broadcasting or other low-bitrate services produce significant degradation of sound quality in spite of the fact that they select SHORT blocks correctly according to the nature of input signals.
Referring back to the foregoing conventional technique (Japanese Patent Application Publication No. 2005-3835), the encoder determines a perceptual entropy threshold according to the number of available bits under control of a bit reservoir. This perceptual entropy threshold is used to select either LONG block or SHORT block. When only an insufficient number of bits are available, frames containing an attack sound are coded not as SHORT blocks, but as LONG blocks to prevent the resulting sound from quality degradation.
This conventional technique, however, simply switches the choice from SHORT block to LONG block in a starving condition where the sound quality would be worse than the case of pre-echoes. LONG block coding in this case eventually develops pre-echoes and consequent quality degradation. The foregoing technique is not an optimal solution for the problem of sound quality degradation.