1. Field of the Invention
The present invention generally relates to a digital-audio-signal coding device, a digital-audio-signal coding method and a medium in which a digital-audio-signal coding program is stored, and, in particular, to compressing/coding of a digital audio signal used for a DVD, digital broadcast and so forth.
2. Description of the Related Art
In the related art, a human psychoacoustic characteristic is used in high-quality compression/coding of a digital audio signal. This characteristic is such that a small sound is inaudible as a result of being masked by a large sound. That is, when a large sound develops at a certain frequency, small sounds at vicinity frequencies are inaudible by the human ear as a result of being masked. The limit of a sound pressure level below which any signal is inaudible due to masking is called a masking threshold. Further, regardless of masking, the human ear is most sensitive to sounds having frequencies in vicinity of 4 kHz, and the sensitivity decreases as the frequency of the sound moves further away from 4 kHz. This feature is expressed by the limit of a sound pressure level at which the sound is audible in an otherwise quiet environment, and this limit is called an absolute hearing threshold.
Such matters will now be described in accordance with FIG. 1 which shows an intensity distribution of an audio signal. The thick solid line (A) represents the intensity distribution of the audio signal. The broken line (B) represents the masking threshold for the audio signal. The thin solid line (C) represents the absolute hearing threshold. As shown in the figure, for the human ear, only the sounds having the sound pressure levels higher than the respective masking levels for the audio signal and also higher than the absolute hearing level are audible by the human ear. Accordingly, even when only the information from the portions in which the sound pressure levels are higher than the respective masking levels for the audio signal and also higher than the absolute hearing level is extracted from the intensity distribution of the audio signal, the thus-obtained signal can be sensed as being the same as the original audio signal, acoustically.
This is equivalent to allocation of coding bits only to the hatched portions in FIG. 1 in coding of the audio signal. This bit allocation is performed in units of scalefactor bands (D) which are obtained as a result of the entire band of the audio signal being divided. The lateral width of each hatched portion corresponds to the respective scalefactor-band width.
In each scalefactor band, the sounds having the intensities lower than the lower limit of the respective hatched portion are inaudible using the human ear. Accordingly, as long as the error in intensity between the original signal and the coded and decoded signal does not exceed this lower limit, the difference therebetween cannot be sensed by the human ear. In this sense, the lower limit of a sound pressure level for each scalefactor band is called an allowable distortion level. When quantizing and compressing an audio signal, it is possible to compress the audio signal without degrading the sound quality of the original sound as a result of performing quantization in such a way that the quantization-error intensity of the coded and decoded sound with respect to the original sound does not exceed the allowable distortion level for each scalefactor band. Therefore, allocating coding bits only to the hatched portions is equivalent to quantizing the original audio signal in such a manner that the quantization-error intensity in each scalefactor band is just equal to the allowable distortion level.
Of such a method of coding an audio signal, MPEG (Moving Picture Experts Group) Audio, Dolby Digital and so forth are known. In any method, the feature described above is used. Among them, the method of MPEG-2 Audio AAC (Advanced Audio Coding) standardized in ISO/IEC 13818-7: 1997(E), xe2x80x98Information technologyxe2x80x94Generic coding of moving pictures and associated audio informationxe2x80x94, Part 7: Advanced Audio Coding (AAC)xe2x80x99 (simply referred to as ISO/IEC 13818-7, hereinafter) is presently said to have the highest coding efficiency. The entire contents of ISO/IEC 13818-7 are hereby incorporated by reference.
FIG. 2 is a block diagram showing a basic arrangement of an AAC (Advanced Audio Coding) encoder. An audio signal input to the AAC encoder is a sequence of blocks of samples which are produced along the time axis such that adjacent blocks overlap with one another. (The frequency with which the samples of sound are taken, which samples constitute the digital audio signal, is called xe2x80x98sampling frequency of the digital audio signalxe2x80x99.) Each block of the audio signal is transformed into a number of spectral scalefactor-band components via a filter bank 73. A psychoacoustic model 71 calculates an allowable distortion level for each scalefactor-band component of the audio signal. A gain control 72 and the filter bank 73 map the blocks of the audio signal into the frequency domain through MDCT (Modified Discrete Cosine Transform). A TNS (Temporal Noise Shaping) 74 and a predictor 76 perform predictive coding. An intensity/coupling 75 and an MS stereo (Middle Side Stereo) (abbreviated as M/S, hereinafter) 77 perform stereophonic correlation coding. Then, scalefactors are determined by a scalefactor module 78, and a quantizer 79 quantizes the audio signal based on the scalefactors. The scalefactors correspond to the allowable distortion level shown in FIG. 1, and are determined for the respective scalefactor bands. After the quantization, based on a predetermined Huffman-code table, a noiseless coding module 80 provides Huffman codes for the scalefactors and for the quantized values, and performs noiseless coding. Finally, a multiplexer 81 forms a code bitstream.
MDCT performed by the filterbank 73 is such that DCT is performed on the audio signal in such a way that adjacent transformation ranges are overlapped by 50% along the time axis, as shown in FIG. 3. Thereby, distortion developing at a boundary portion between adjacent transformation ranges can be suppressed. Further, the number of MDCT coefficients generated is half the number of samples included in the transformation range. In AAC, either a long transformation range (defined by a long window) or short transformation ranges (each defined by a short window) is/are used for mapping the audio signal into the frequency domain. The portion of each block of the input audio signal defined by the long window is called a long block, and the portion of each block of the input audio signal defined by the short window is called a short block, wherein the long block includes 2048 samples and the short block includes 256 samples. In MDCT, defining long blocks from an audio signal, each for a first predetermined number of samples (2048 samples, in the above-mentioned example, as shown in FIG. 4) with a long window, for performing MDCT on the audio signal using the thus-defined long blocks for mapping the audio signal into the frequency domain will be referred to as xe2x80x98using the long block typexe2x80x99, and defining short blocks from an audio signal, each for a second predetermined number (smaller than the first predetermined number) of samples (256 samples, in the above-mentioned example, as shown in FIG. 5) with a short window, for performing MDCT on the audio signal using thus-defined short blocks for mapping the audio signal into the frequency domain will be referred to as xe2x80x98using the short block typexe2x80x99, hereinafter. The number of MDCT coefficients generated from the long block is 1024, and the number of MDCT coefficients generated from each short block is 128. When the short block type is used, 8 short blocks are defined successively at any time (as shown in FIG. 5). Thereby, the number of MDCT coefficients generated is the same when using the short block type and using the long block type.
Generally, for a steady portion in which variation in signal waveform is a little as shown in FIG. 4, the long block type is used. For an attack portion in which variation in signal waveform is violent as shown in FIG. 5, the short block type is used. Which thereof is used is important. When the long block type is used for a signal such as that shown in FIG. 5, noise called pre-echo develops preceding an attack portion. When the short block type is used for a signal such as that shown in FIG. 4, suitable bit allocation is not performed due to lack of resolution in the frequency domain, the coding efficiency decreases, and noise develops, too. Such drawbacks are remarkable especially for a low-frequency sound.
When the short block type is used, grouping is performed. The grouping is to group the above-mentioned 8 successive short blocks into groups, each group including one or a plurality of successive blocks, the scalefactor for which is the same. By treating a plurality of blocks, for which the scalefactor is common, as those included in one group, it is possible to improve the information amount reducing effect. Specifically, when the Huffman codes are allocated to the scalefactors in the noiseless coding module 80 shown in FIG. 2, allocation is performed not in short-block units but in the group unit. FIG. 6 shows an example of grouping. In the case of FIG. 6, the number of groups is 3, the 0-th group includes 5 blocks, the 1-th group includes 1 block, and the 2-th group includes 2 blocks. When grouping is not performed appropriately, increase in the number of codes and/or degradation of the sound quality occur. When the number of groups is too large with respect to the number of blocks, the scalefactors which otherwise can be coded in common will be coded repeatedly, and, thereby, the coding efficiency decreases. When the number of groups is too small with respect to the number of blocks, common scalefactors are used even when variation of the audio signal is violent. As a result, the sound quality is degraded. In ISO/IEC13818-7, with regard to grouping, although rules for syntax of codes are included, no specific standards/methods for grouping are included.
As described above, when coding is performed, the long block type and short block type are appropriately used for an input audio signal. Deciding whether the long or short block type is used is performed by the psychoacoustic model 71 in FIG. 2. ISO/IEC 13818-7 includes an example of a method for making a decision as to whether the long or short block type is used for each target block. This deciding processing will now be described in general.
Step 1: Reconstruction of an Audio Signal
1024 samples for a long block (128 samples for a short block) are newly read, and, together with 1024 samples (128 samples) already read for the preceding block, a series of signals having 2048 samples (256 samples) is reconstructed.
Step 2: Windowing by Hann Window and FFT
The 2048 samples (256 samples) of audio signal reconstructed in the step 1 is windowed by a Hann window, FFT (Fast Fourier Transform) is performed on the signal, and 1024(128) FFT coefficients are calculated.
Step 3: Calculation of Predicted Values for FFT Coefficient
From the real parts and imaginary parts of the FFT coefficients for the preceding two blocks, the real parts and imaginary parts of the FFT coefficients for the target block are predicted, and 1024 (128) predicted values are calculated for each of them.
Step 4: Calculation of Unpredictability
From the real parts and imaginary parts of the FFT coefficients calculated in the step 2 and the predicted values for the real parts and imaginary part of the FFT coefficients calculated in the step 3, unpredictability is calculated for each of them. Unpredictability has a value in the range of 0 to 1. When unpredictability is close to 0, this indicates that the tonality of the signal is high. When unpredictability is close to 1, this indicates that the tonality of the signal is low.
Step 5: Calculation of the Intensity of the Audio Signal and Unpredictability for Each Scalefactor Band
The scalefactor bands are ones corresponding to those shown in FIG. 1. For each scalefactor band, the intensity of the audio signal is calculated based on the respective FFT coefficients calculated in the step 2. Then, the unpredictability calculated in the step 4 is weighted with the intensity, and the unpredictability is calculated for each scalefactor band.
Step 6: Convolution of the Intensity and Unpredictability with Spreading Function
Influences of the intensities and unpredictabilities in the other scalefactor bands for each scalefactor band are obtained using the spreading function, and they are convolved, and are normalized, respectively.
Step 7: Calculation of Tonality Index
For each scalefactor band b, based on the convolved unpredictability (cb(b)) calculated in the step 6, the tonality index tb(b) (=xe2x88x920.299-0.43 loge(cb(b)) is calculated. Further, the tonality index is limited to the range of 0 to 1. The tonality index indicates a degree of tonality of the audio signal. When the index is close to 1, this means that the tonality of the audio signal is high. When the index is close to 0, this means that the tonality of the audio signal is low.
Step 8: Calculation of S/N Ratio
For each scalefactor band, based on the tonality index calculated in the step 7, an S/N ratio is calculated. Here, a property that the masking effect is larger for low-tonality signal components than for high-tonality signal components is used.
Step 9: Calculation of Intensity Ratio
For each scalefactor band, based on the S/N ratio calculated in the step 8, the ratio between the convolved audio signal intensity and masking threshold is calculated.
Step 10: Calculation of Allowable Distortion Level
For each scalefactor band, based on the audio signal intensity calculated in the step 6, and the ratio between the audio signal intensity and masking threshold calculated in the step 9, the masking threshold is calculated.
Step 11: Consideration of Pre-echo Adjustment and Absolute Hearing Threshold
Pre-echo adjustment is performed on the masking threshold calculated in the step 10 using the allowable distortion level of the preceding block. Then, the larger one between the thus-obtained adjusted value and the absolute hearing threshold is used as the allowable distortion level of the currently processed block.
Step 12: Calculation of Perceptual Entropy (PE)
For each block type, that is, for the long block type and for the short block type, a perceptual entropy (PE) defined by the following equation is calculated:   PE  =      -                  ∑        b            ⁢                                    w            ⁡                          (              b              )                                ·                      log            10                          ⁢                              nb            ⁡                          (              b              )                                                          e              ⁡                              (                b                )                                      +            1                              
In the above equation, w(b) represents the width of the scalefactor band b, nb(b) represents the allowable distortion level in the scalefactor band b calculated in the step 11, and e(b) represents the audio signal intensity in the scalefactor band b calculated in the step 5. It can be considered that PE corresponds to the sum total of the areas of the bit allocation ranges (hatched portions) shown in FIG. 1.
Step 13: Decision of Long/Short Block Type (see a flow chart shown in FIG. 7 for decision as to whether the long or short block type is used).
When the value of PE (obtained in a step S10 in FIG. 7) calculated for the long block type in the step 12 is larger than a predetermined constant (switch_pe), the short block type is used for the target block (in steps S11 and S12, in FIG. 7). When the value of PE calculated for the long block type in the step 12 is not larger than the predetermined constant (switch_pe), the long block type is used for the target block (in steps S11 and S13, in FIG. 7). The constant, switch_pe, is determined depending on the application.
The above-described method is the method for decision as to whether the long or short block type is used, described in ISO/IEC13818-7. However, in this method, an appropriate decision is not always reached. That is, the long block type is selected to be used even in a case where the short block type should be selected, or, the short block type is selected to be used even in a case where the long block type should be selected. As a result, the sound quality may be degraded.
Japanese Laid-Open Patent Application No. 9-232964 discloses a method in which an input signal is taken at every predetermined section, the sum of squares is obtained for each section, and a transitional condition is detected from the degree of change in the signal of the sum of squares between at least two sections. Thereby, it is possible to detect the transient condition, that is, to detect when a block type to be used is changed between the long and short block types, merely as a result of calculating the sum of squares of the input signal on the time axis without performing orthogonal transformation processing or filtering processing. However, this method uses only the sum of squares of an input signal but does not consider the perceptual entropy. Therefore, a decision not necessarily suitable for the acoustic property may be made, and the sound quality may be degraded.
A method will now be described. In the method, the short blocks of a block of an input audio signal are grouped in a manner such that the difference between the maximum value and minimum value in perceptual entropy of the short blocks in the same group is smaller than a threshold. Then, when the result thereof is such that the number of groups is 1, or this condition and another condition are satisfied, the block of the input audio signal is mapped into the frequency domain using the long block type. In the other cases, the block of the input audio signal is mapped into the frequency domain using the short block type. This method is performed by an arrangement shown in FIG. 8B. An entropy calculating portion 31 calculates the perceptual entropy for each short block. A grouping portion 32 groups ones of the short blocks. A difference calculating portion 33 calculates the difference between the maximum value and minimum value in perceptual entropy of the short blocks included in the thus-obtained group. A grouping determining portion determines, based on the thus-obtained difference, whether the grouping is allowed. A long/short-block-type deciding portion 35 decides to use the long or short block when the number of the thus-allowed groups is 1.
This method will now be described in detail in accordance with FIG. 8A showing an operation flow of this method. As an example of an input audio signal, audio data shown in FIG. 9 is used. In FIG. 9, corresponding consecutive numbers are given to 8 successive short blocks. The perceptual entropy PE(i) of the audio data shown in FIG. 9 for each short block i is shown in FIG. 10.
First, 8 short blocks are obtained from a block of an input audio signal, as shown in FIG. 9. Then, for the 8 short blocks, the perceptual entropies are calculated, respectively, and are represented by PE(i) (0xe2x89xa6ixe2x89xa67), in sequence, in a step S20. This calculation can be achieved as a result of the method described in the steps 1 through 12 of the method for deciding as to whether the long or short block type is used for each target block in ISO/IEC13818-7 described above being performed on each short block. Then, initializing is performed such that group_len[0]=1, and group_len[gnum]=0 (0xe2x89xa6gnumxe2x89xa67) in a step S21, wherein gnum represents a respective one of consecutive numbers of groups resulting from grouping, and group_len[gnum] represents the number of the short blocks included in the gnum-th group. Then, initializing is performed such that gnum=0, min=PE(0) and max=PE(0), in a step S22. These min and max represent the minimum value and the maximum value of PE(i), respectively. Then, the index i is initialized so that i=1, in a step S23. This index corresponds to a respective one of the consecutive numbers of the short blocks.
Then, min and max are updated with PE(i). That is, when PE(i) less than min, min=PE(i), and when PE(i) greater than  max, max=PE(i), in a step S24. Then, a decision is made as to grouping, in a step S25. That is, the difference, maxxe2x88x92min, is obtained, is compared with a predetermined threshold th, and, when the difference is equal to or larger than the threshold th, the operation proceeds to a step S26 so that the short blocks ixe2x88x921 and i are included in different groups. When the difference is smaller than the threshold th, a decision is made such that the short blocks ixe2x88x921 and i are included in the same group, and the operation proceeds to a step S27. In this example, it is assumed that th=50. That is, grouping is performed such that the difference between the maximum value and minimum value of PE(i) becomes smaller than 50. A decision is made such that the short blocks 0 and 1 are included in the same group, and the operation proceeds to the step S27. Because gnum=0 in this time, the short blocks 0 and 1 are included in the 0-th group. Then, the value of group_len[gnum] is incremented by 1 in a step S28. This means that the number of short blocks included in the gnum-th group is increased by 1. In this example, because initializing is performed such that gnum=0 and group_len[0]=1 in the steps S21 and S22, group_len [0]=2 in the step S27. This corresponds to the matter that the two blocks, block 0 and block 1, are already fixed as the short blocks included in the 0-th group.
Then, the index i is incremented by 1 in a step S28. Then, when i is smaller than 7, the operation returns to the step S24, in a step S29.
Then, operations similar to those described above are repeated until i=4. When i=4, in the example shown in FIGS. 9 and 10, min=96 and max=137 in the step S24. Then, in the step S25, maxxe2x88x92min=41 less than 50=th. As a result, the operation proceeds to the step 27 from the step 25. Then, in the step S27, group_len[0]=5. This corresponds to the matter that the five blocks, blocks 0, 1, 2, 3 and 4, are fixed as the short blocks included in the 0-th group. Then, after i=5 in the step S28, the operation again returns to the step S24 through the step S29. Then, because PE(5)=152 at this time, min=96 and max=152. Then, in the step S25, maxxe2x88x92min=56 greater than 50=th, in the step S25. As a result, the operation proceeds to the step S26. This means that the short blocks 4 and 5 are included in different groups. In the step S26, the value of gnum is incremented by 1, and each of min and max is replaced by the latest PE(i). Here, gnum=1, min=152 and max=152. The matter that gnum=1 corresponds to the matter that the group includes the short block 5 is the 1-th group.
Then, in the step S27, group_len[1] is incremented by 1. Because the group_len[1] is initialized to be 0 in the step S21, again group_len[1]=1, here. This corresponds to the matter that one block, the block 5 is fixed as the short block included in the 1-th group.
Then, similarly, i=6 in the step S28 in FIG. 8A, and the operation returns to the step S24 from the step S29. Then, at this time, because PE(6)=269, min=152 and max=269. Then, in the step S25, maxxe2x88x92min=117 greater than 50=th, and, as a result, the operation proceeds to the step S26. That is, the short blocks 5 and 6 are included in different groups. Then, in the step s26, gnum=2, min=269 and max=269. Then, in the step S27, group_len[2]=1. Then, in the step S28, i=7. Then, similarly to the above, because PE(7)=231 in the step S24, min=231 and max=269. Then, in the step S25, maxxe2x88x92min=38 less than 50=th. As a result, the operation proceeds to the step S27. That is, both the short blocks 6 and 7 are included in the 2-th group. Correspondingly thereto, group_len[2]=2 in the step S27. Then, in the next step S28, i=8. Then, in the step S29, the operation is decided to proceed to the step S30. Thus, grouping is completed for all the 8 short blocks.
In this example, in the end, gnum=2, group_len[0]=5, group_len[1]=1 and group_len[2]=2. That is, the number of groups is 3, the 0-th group includes 5 short blocks, the 1-th group includes one short block and the 2-th group includes two short blocks.
How to decide, from the number of groups as the result of grouping, whether the long or short block type is used will now be described. In the step S30, it is determined whether or not the value of gnum is 0. When the value of gnum is 0, the number of groups is 1. When the value of gnum is not 0, the number of groups is equal to or larger than 2. Therefore, when gnum=0, the operation proceeds to a step 31, and it is decided to perform MDCT on the block of the input audio signal using the long block type, that is, a single long block is obtained from the block of the input audio signal for performing MDCT on-the input audio signal. When gnumxe2x89xa00, the operation proceeds to a step 32, and it is decided to perform MDCT on the block of the input audio signal using the short block type, that is, 8 short blocks are obtained from the block of the input audio signal for performing MDCT on the input audio signal.
However, also in this method, there is a case where an appropriate decision as to whether the long or short block type is used cannot be performed. This case is a case where audio data including low frequency components having high tonalities is coded. MDCT using the short block type results in increase in the resolution in the time domain, but decrease in the resolution in the frequency domain. Further, the human ear has a masking property such that the resolution is high in a low-frequency range, and, in particular, only a very narrow frequency-band component is masked in audio data having high tonality. When audio data including low frequency components having high tonalities is mapped into the frequency domain using the short block type, due to decrease to the resolution in the frequency domain when the short block type is used, the energy of the original audio data is dispersed in surrounding frequency bands. Then, when the energy thus spreads to the outside of the masking range in low-frequency components of the human ear, the human ear senses degradation in the sound quality. This indicates that decision as to whether the long or short block type is used based only on the perceptual entropies of the short blocks is not sufficient, and, it is necessary to consider to further combine tonality of audio data and the frequency-dependency of the masking property.
The present invention has been devised for solving these problems, and, an object of the present invention is to provide, with the tonality of an input audio data and frequency dependency of masking property of the human ear in mind, conditions for enabling an appropriate decision as to whether the long or short block type is used without resulting in degradation in the sound quality, and to provide a digital-audio-signal coding device, a digital-audio-signal coding method and a medium in which a digital-audio-signal coding program is stored, in which it is possible to make a decision as to whether the long or short block type is used appropriately depending on the sampling frequency of input audio data.
In order to achieve the above-mentioned objects, a device for coding a digital audio signal according to the present invention comprises:
a converting portion which converts each of blocks of an input digital audio signal into a number of frequency-band components, the blocks being produced from the signal along a time axis;
a bit-allocating portion which allocates coding bits to each frequency band;
a scalefactor determining portion which determines a scalefactor in accordance with the number of the coding bits thus allocated; and
a quantizing portion which quantizes the digital audio signal using the thus-determined scalefactors,
wherein:
the converting portion comprises a block-type deciding portion which makes a decision as to whether a long or short block type is used for mapping the input digital audio signal into the frequency domain;
the block-type deciding portion comprises:
a tonality-index calculating portion which calculates a tonality index of the digital audio signal in each of a predetermined one or plurality of frequency bands of the number of frequency bands;
a comparing portion which compares each of the thus-calculated tonality indexes with a predetermined one or plurality of thresholds; and
a deciding portion which makes a decision as to whether the long or short block type is used based on the thus-obtained comparison result.
The block-type deciding portion may further comprise a parameter deciding portion which decides parameters and/or a determining expression to be used in a process of making a decision as to whether the long or short block type is used, depending on the sampling frequency of the input digital audio signal.
The block-type deciding portion may further comprise a decision method deciding portion which makes a decision that a decision be made as to whether the long or short block is used using the tonality indexes, when the sampling frequency of the input digital audio signal is larger than a predetermined threshold.
The parameter deciding portion may increase the number of the frequency bands to be used and shifts the frequency bands to be selected to higher ones, when the sampling frequency is lower.
Thereby, the following problems can be solved: When the number of frequency bands used for the decision is small, only the tonality in the limited number of frequency bands is considered. Accordingly, in a case where the tonality is high in other frequency bands, and, therefore, the long block type should be used, a decision is made to use the short block type. Further, when the number of frequency bands used for the decision is large, a decision is made to use the long block type only in a special case where the tonality is high in every frequency band thereof.
As a result, it is possible to provide appropriate determination conditions for making a decision as to whether the long or short block type is used, with the tonality of input audio data and frequency dependency of masking property of the human ear in mind, so that the use of the thus-provided determination conditions does not result in degradation in the sound quality.
Other objects and further features of the present invention will become more apparent from the following detailed description when read in conjunction with the accompanying drawings.