1. Field of the Invention
The present invention relates to a digital audio coding method, a digital audio coding apparatus and a recording medium. More particularly, the present invention relates to a compression and coding technique of a digital audio signal used for DVD, digital broadcast and the like.
2. Description of the Related Art
As previously known, human psychoacoustic characteristics are utilized in the technique of high quality compression and coding of a digital audio signal. One of the characteristics is that small sound is masked by large sound so that small sound can not be heard. That is, when large sound having a frequency occurs, small sound near the frequency is masked so that it can not be heard. The lower limit intensity of the sound in which the sound is masked and can not be heard is called a masking threshold.
As for the human ear, the sensitivity becomes the highest for sound around 4 kHz irrespective of the masking. As the frequency band becomes more apart from 4 kHz, the sensitivity becomes worse. This characteristic can be represented as a lower limit intensity which the human ear can perceive in a silent situation. This lower limit intensity is called an absolute hearing threshold.
The characteristics will be described more particularly with reference to FIG. 1. Intensity of audio signal is represented by the thick solid line. The masking threshold for the audio signal is represented by the dotted line. The thin solid line represents the absolute hearing threshold. That is, the human ear can perceive a sound only when the intensity is larger than the values represented by the dotted line and the thin solid line. Therefore, if information which is larger than the dotted line and the thin solid line is extracted from information represented by the thick solid line, the human ear perceives the extracted information to be the same as the original audio signal.
When performing coding, this is equivalent to assigning coding bits only to parts indicated by shaded regions in FIG. 1. When assigning coding bits in this example, the whole frequency band of the audio signal is divided into a plurality of small bands so that coding bits are assigned to each divided band. The width of each shaded area corresponds to the divided bandwidth.
In each divided bandwidth, the human ear can not perceive a sound of intensity equal to or smaller than the lower limit of the shaded area. Thus, if the intensity difference between original sound and coded/decoded sound does not exceed this lower limit, the sound can not be heard. In this sense, the intensity of the lower limit is called an allowed distortion level. When an audio signal is compressed by performing quantization, the audio signal can be compressed without loss of quality of the original sound by performing quantization such that quantization distortion level of coded/decoded sound with respect to the original sound becomes equal to or smaller than the allowed distortion level.
Accordingly, assigning coding bits only to the shaded regions shown in FIG. 1 corresponds to performing quantization such that quantization distortion level in each divided band becomes just the allowed distortion level.
There are MPEG Audio, Dolby Digital and the like as coding methods of a audio signal. Each of the methods uses the property described above. In the methods, MPEG-2 Audio AAC (Advanced Audio Coding) standardized in ISO/IEC13818-7 is regarded as being most efficient for coding.
FIG. 2 shows a basic block diagram of a coding apparatus for AAC. The psychoacoustic model part 1 calculates the allowed distortion level for each divided band of an input audio signal which is divided into frames along time base.
For the input audio signal which is divided into frames, a gain control part 2 performs gain control, a filter bank 3 converts the input audio signal to the frequency domain by MDCT (Modified Discrete Cosine Transform), a TNS 4 performs a temporal noise shaping process, an intensity/coupling stereo part 5 performs intensity/coupling, a prediction part 6 performs a predictive coding process, an M/S stereo part 7 performs a middle side stereo process. After that, a part 8 determines normalized coefficients, and a quantization part 9 quantizes the audio signal based on the normalized coefficients. The normalized coefficients correspond to the allowed distortion level shown in FIG. 1 which is determined for each divided band.
After quantization, a noiseless coding part 10 performs a noiseless coding process by providing each of the normalized coefficient and the quantized value with Huffman code based on a predetermined Huffman code table. Finally, a code bit stream is formed by a multiplexor 11.
According to the MDCT in the filter bank 3, as shown in FIG. 3, DCT is performed in which each transform region overlaps with another transform region by 50% with respect to time axis. Accordingly, occurrence of distortion in boundary parts can be suppressed for each transform region. The number of MDCT coefficients is half of the number of samples of the transform region. According to AAC, a long transform region (long block) including 2048 samples or eight short transform regions including 256 samples in each transform region (short block) is applied for an input audio signal frame. Thus, the number of MDCT coefficients is 1024 for the long block and 128 for the short block. As for the short block, eight blocks are always used successively so that the number of the MDCT coefficients becomes the same as that of the long block.
Generally, as shown in FIG. 4, the long block is used for a steady-state part where variation of a signal waveform is small. As shown in FIG. 5, the short block is used for an attack part where variation of a signal waveform is large.
It is important to use the long block or the short block appropriately. When the long block is used for a signal like that shown in FIG. 5, noise which is called pre-echo occurs before attack. In addition, when the short block is used for a part shown in FIG. 4, bit assignment is not properly performed due to lack of resolution in the frequency domain so that coding efficiency decreases and noise also occurs.
As mentioned above, it is important to calculate the allowed distortion level for each divided band and to determine the long block or the short block properly. The psychoacoustic model part 1 shown in FIG. 2 performs these processes. In the ISO/IEC13818-7, examples of a calculation method of the allowed distortion level for each divided band and a method of determining the long block or the short block for each current frame are shown. In the following, an outline of processes of the methods will be described. B.2.1.4 (p.93) in the ISO/IEC13838-7 can be referred to about details of these processes.
Step 1) Reconstruction of Audio Signal
1024 samples (128 samples for the short block) are newly read for the long block and a signal series of 2048 samples (258 samples) is reconstructed by concatenating the newly read samples and samples already read from a previous frame.
Step 2) Windowing by a Hann Window and FFT
The audio signal of 2048 samples (256 samples) reconstructed in step 1 is windowed by a Hann window and FFT (Fast Fourier Transform) is calculated so that 1024 (128) FFT coefficients are calculated.
Step 3) Calculation of Predicted Values of FFT Coefficients
Real parts and imaginary parts of FFT coefficients of a current frame are predicted from real parts and imaginary parts of FFT coefficients of previous two frames so that 1024 (128) predicted values are calculated for each of the real part and imaginary part.
Step 4) Calculation of an Unpredictability Measure
The unpredictability measure is calculated from the real part and the imaginary part of each FFT coefficient calculated in step 2 and predicted values of the real part and the imaginary part of each FFT coefficient calculated in step 3. The unpredictability measure takes from 0 to 1. The nearer to 0 the unpredictability measure is, the nearer to a simple tone the audio signal is. In addition, the nearer to 1 the unpredictability measure is, the nearer to noise the audio signal is.
Step 5) Calculation of Intensity and Unpredictability of the Audio Signal for Each Divided Band
The divided band here corresponds to that shown in FIG. 1. The intensity of the audio signal is calculated for each divided band based on each FFT coefficient calculated in step 2. In addition, the unpredictability calculated in step 4 is weighted by the intensity so that weighted unpredictability is calculated for each divided band.
Step 6) Convolution of the Intensity and the Unpredictability with a Spreading Function
For each divided band, effect to the audio signal intensity and the unpredictability by other divided bands is calculated by the spreading function and each of the audio signal intensity and the unpredictability is convoluted and normalized.
Step 7) Calculation of Tonality Index
In each divided band b, the tonality index (tb(b)) is calculated by the following equation (1) based on the convoluted unpredictability (cb(b)) calculated in step 6.
tb(b)=xe2x88x920.299xe2x88x920.43 loge(cb(b))xe2x80x83xe2x80x83(1)
In addition, the tonality index is limited to a range from 0 to 1. The nearer to 1 the tonality index is, the nearer to a simple tone the audio signal is. In addition, the nearer to 0 the tonality index is, the nearer to noise the audio signal is.
Step 8) Calculation of SNR
In each divided band, SNR is calculated based on the tonality index calculated in step 7. In the calculation, a property that masking effect of noise component is larger than that of simple tone component is utilized.
Step 9) Calculation of Intensity Ratio
In each divided band, the ratio between the convoluted audio signal and the masking threshold is calculated based on the SNR calculated in step 8.
Step 10) Calculation of Masking Threshold
In each divided band, the masking threshold is calculated based on the convoluted audio signal intensity calculated in step 6 and the ratio between the audio signal intensity and the masking threshold calculated in step 9.
Step 11) Pre-echo Control and Consideration of Absolute Hearing Threshold
In each divided band, pre-echo control is performed on the masking threshold calculated in step 10 by using the allowed distortion level of a previous block. In addition, a larger value between the controlled value and the absolute hearing threshold is set to be the allowed distortion level of the current frame.
Step 12) Calculation of Perceptual Entropy (PE)
For each of the long block and the short block, the perceptual entropy which is defined by the following equation (2) is calculated,                     PE        =                  -                                    ∑              b                        ⁢                                                            w                  ⁡                                      (                    b                    )                                                  ·                                  log                  10                                            ⁢                                                nb                  ⁡                                      (                    b                    )                                                                                        e                    ⁡                                          (                      b                      )                                                        +                  1                                                                                        (        2        )            
wherein W(b) is width of the divided band b, nb(b) is the allowed distortion level in the divided band b calculated in step 11, e(b) is the audio signal intensity of the divided band b calculated in step 5. PE corresponds to total area of the bit assigned regions (diagonally shaded regions) shown in FIG. 1.
Step 13) Determining Whether the Long Block or the Short Block is Used
When the PE for the long block calculated in step 12 is larger than a predetermined constant (switch_pe), the current frame is judged to be the short block. When the PE is smaller than the constant, the current frame is judged to be the long block. The predetermined constant (switch_pe) is a value which is determined according to an application.
The above-mentioned methods are methods of calculation of the allowed distortion level and determining long block or short block described in the ISO/IEC13818-7.
In the above-mentioned determining method, the absolute hearing threshold is used in step 11 in which, in each divided band, a larger value between the pre-echo controlled masking threshold and the absolute hearing threshold is set as the allowed distortion level of the divided band. Then, in a divided band where the intensity of original sound is smaller than the absolute hearing threshold, it is regarded that the original sound can not be listened so that coding bits are not assigned at all or only a few coding bits are assigned in the band.
In principle, the absolute hearing threshold should be constant, that is, it should not vary according to input sound. In the ISO/IEC13818-7, it is recommended that a predetermined table value is used as the absolute hearing threshold.
However, when the allowed distortion level is obtained according to the above-mentioned processes by using a fixed absolute hearing threshold and bit assignment and coding are performed based on the fixed allowed distortion level, there are cases where satisfactory sound quality can not be obtained. For example, for a sound of a female voice vocal song which has frequency distribution of FIG. 6, good sound quality can be obtained by an absolute hearing threshold shown in the FIG. 6. However, when this absolute hearing threshold is applied to an orchestra sound shown in FIG. 7, grating noise is heard. The reason is that, although sound near 10 kHz-15 kHz is important for the orchestra sound, when the absolute hearing threshold shown in FIG. 7 is used, it is judged that sound near 10 kHz-15 kHz is lower than the absolute hearing threshold so that adequate bits are not assigned. When the absolute hearing threshold is lowered as a whole as shown in FIG. 8, the sound quality improves since the sound near 10 kHz-15 kHz becomes larger than the absolute hearing threshold so that adequate bits are assigned.
However, when the absolute hearing threshold of FIG. 8 is applied to the female voice vocal sound of FIG. 6 as shown in FIG. 9, the sound quality deteriorates. The reason is that, although sound of frequencies smaller than 10 kHz is important for the female voice vocal sound, bits are also assigned to sound near 12 kHz-15 kHz so that the number of bits which are assigned to frequencies under 10 kHz becomes relatively small.
Thus, according to the conventional method where the absolute hearing threshold is fixed, there is a problem in that adequately good sound quality is not necessarily obtained.
In addition, several methods of coding audio signals by using masking effect based on the psychoacoustic model are proposed, for example, in Japanese laid-open patent applications No.5-248972, No.7-46137 and No.9-101799. However, setting methods of the absolute hearing threshold are not proposed in any publication.
It is an object of the present invention to provide a digital audio coding apparatus, a digital audio coding method and a recording medium for improving sound quality by varying the absolute hearing threshold according to input audio data.
The above object of the present invention is achieved by a digital audio coding apparatus comprising:
a part which converts a frame of digital audio data into a frequency domain;
a part which divides the digital audio data into a plurality of bands;
a part which calculates an allowed distortion level by using an absolute hearing threshold for each divided band and assigns coding bits;
a change part which changes the absolute hearing threshold adaptively on the basis of intensity distribution of the digital audio data in the frequency domain.
The above object of the present invention is also achieved by a digital audio coding apparatus comprising:
a part which divides input digital audio data into frames along a time axis;
a part which performs processes including sub-band division and conversion into a frequency domain on each frame;
a part which divides the digital audio data into a plurality of bands and assigns coding bits to each band;
a part which obtains normalized coefficients according to the number of coding bits and encodes the digital audio data by quantizing with the normalized coefficients;
a change part which changes an absolute hearing threshold adaptively on the basis of intensity distribution of the digital audio data in the frequency domain; and
a part which calculates an allowed distortion level for each band by using the absolute hearing threshold and assigns the coding bits by using the allowed distortion level.
According to the above-mentioned invention, since the absolute hearing threshold is changed adaptively, the problems of the conventional technique can be solved so that sound quality is improved.
In the above-mentioned digital audio coding apparatus, the change part may change the absolute hearing threshold on the basis of logarithmic values of intensity of the digital audio data for each frame in the frequency domain.
Accordingly, the absolute hearing threshold can be properly changed.
In the above-mentioned digital audio coding apparatus, a straight line may be placed on a graph representing logarithmic values of intensity of the digital audio data in the frequency domain and the absolute hearing threshold may be set according to an area of a part between a curve representing the logarithmic values of intensity and the straight line.
In the above-mentioned digital audio coding apparatus, the change part may set the absolute hearing threshold to be high when the area of the part between the curve representing the logarithmic values of intensity and the straight line is larger than a predetermined value, and set the absolute hearing threshold to be low when the area is smaller than the predetermined value.
According to the above-mentioned invention, the absolute hearing threshold can be set properly according to input audio data so that sound quality is improved.
In the above-mentioned digital audio coding apparatus, an inclination of the straight line and a frequency range over which the area is calculated may be predetermined, and an initial point of the straight line may be set according to input digital audio data.
Accordingly, the absolute hearing threshold can be set easily.
In the above-mentioned digital audio coding apparatus, a maximum value among initial several points in the curve on a low frequency side in a frequency range over which the area is calculated may be set to be a value of the straight line for the lowest frequency in the frequency range.
According to the above-mentioned invention, the straight line can be placed properly.
In the above-mentioned digital audio coding apparatus, the change part may divide the frame into a plurality of small blocks and calculate the area for each of the small blocks.
In the above-mentioned digital audio coding apparatus, the change part may calculate a sum of areas of the small blocks, and set the absolute hearing threshold to be high when the sum is larger than a predetermined value, and set the absolute hearing threshold to be low when the sum is smaller than the predetermined value.
The above object of the present invention is also achieved by a digital audio coding apparatus comprising:
a part which divides digital audio data into frames;
a part which converts each frame of the digital audio data to a frequency domain by using a long transform block or a plurality of short transform blocks;
a part which divides the frame of the digital audio data in the frequency domain into a plurality of bands;
a part which calculates an allowed distortion level by using an absolute hearing threshold for each divided band and assigns coding bits; wherein:
when the long transform block is used for conversion,
the frame is divided into a plurality of small blocks and each of the small blocks are converted to the frequency domain;
for each of the small blocks, a straight line is placed on a graph representing logarithmic values of intensity of the digital audio data in the frequency domain and an area of a part between a curve representing the logarithmic values of intensity and the straight line is calculated;
a sum of the areas of the small blocks are calculated, and, the absolute hearing threshold is set to be high when the sum is larger than a predetermined value, and the absolute hearing threshold is set to be low when the sum is smaller than the predetermined value; and
when the short transform blocks are used for conversion, a predetermined fixed absolute hearing threshold is used.
According to the above-mentioned invention, the absolute hearing threshold is changed adaptively so that sound quality is improved when the digital audio coding apparatus which converts audio data by using a long transform block or a plurality of short transform blocks is used.