1. Field of the Invention
The present invention relates to a coding method and a coding apparatus for a sound signal, in which a sound signal such as digital data by a so-called high-efficient coding method.
2. Description of the Related Art
Various types of methods and apparatus for high-efficient coding audio or sound signals are conventionally used. For example, a so-called transform coding scheme (to be described below) is used. That is, a signal on a time axis is framed in units of predetermined periods of time to transform the signal on the time axis of each frame is transformed into a signal on a frequency axis (spectrum transform) and to be divided into a plurality of frequency areas, thereby performing a coding operation in each band. In addition, a so-called band division coding (sub-band coding: SBC) is available in which an audio signal or the like on a time axis is not framed but divided into a plurality of frequency bands to be coded.
A high-efficient coding method and apparatus obtained by combining the band division coding scheme to the transform coding scheme is proposed. In this case, for example, after band division is performed by the band division coding scheme, a signal of each band is spectrum-transformed into a signal on a frequency axis, and the spectrum-transformed signal in each band is coded.
In this case, as a band dividing filter used in the band division coding scheme, for example, a filter such as a QMF (Quadrature Mirror Filter) is used. This filter is described in the letter "Digital coding of speech in subbands" R. E. Crochiere, Bell Syst. Tech. J., Vol. 55, No. 81976. This QMF is used to divide a band into two bands having equal bandwidths. This filter has the following characteristics. That is, so-called aliasing does not occur when the divided bands are synthesized.
The letter "Polyphase Quadrature filters--A new subband coding technique", Joseph H. Rothweiler ICASSP 83, BOSTON) describes a filter dividing method for dividing a band into bands having equal bandwidths. The polyphase quadrature filter has the following characteristics. That is, a signal can be divided into a plurality of bands having equal widths at once.
When, as the spectrum transform described above, the following is performed. For example, an input audio signal is framed in units of predetermined periods of time, and discrete Fourier transform (DFT), discrete cosine transform (DCT), modified discrete cosine transform (MDCT), or the like is performed in each frame, thereby transforming a time axis into a frequency axis. The MDCT is described in the letter "Subband/Transform Coding Using Filter Bank Designs Based on Time Domain Aliasing Cancellation," J. P. Princen A. B. Bradley, Univ. of Surrey Royal Melbourne Inst. of Tech. ICASSP 1987.
When signals divided by a filter or spectrum transform in units of bands as described above is quantized, a band in which quantization noise is generated can be controlled, and aurally high-efficient coding can be performed using properties of so-called masking effect or the like. In addition, when normalization is performed by the maximum value of the absolute value of a signal component in each band before the quantization, coding can be more efficiently performed.
In this case, as a frequency division width used when frequency components (to be referred to as spectrum components) divided into frequency bands are quantized, a bandwidth obtained in consideration of the aural characteristics of human being is often used. More specifically, an audio signal may be divided into a plurality of bands (e.g., 25 bands) each having a bandwidth equal to that of a critical band whose bandwidth generally increases with an increase in frequency. When data in each band at this time is coded, coding is performed by predetermined bit distribution in each band or adaptive bit allocation in each band. For example, when coefficient data obtained by performing the MDCT process is coded by the bit allocation, MDCT coefficient data in each band obtained by the MDCT process in each frame is coded at tile number of adaptive allocation bits.
As the bit distribution method, the two following methods are known.
For example, in the letter "Adaptive Transform Coding of Speech Signals", R. Zelinski, P. Noll, IEEE Transactions of Acoustics, Speech, and Signal Processing, vol. ASSP-25, No. 4, August 1977, bit allocation is performed on the basis of the size of a signal of each band. In this scheme, a quantization noise spectrum is flat, and noise energy is minimum. However, since a masking effect is not aurally used, the hearing sense of noise is not actually optimum.
In addition, for example, in the letter "The critical band coder--digital encoding of the perceptual requirements of the auditory system", M. A. kransner MIT, ICASSP 1980), a method in which a signal-noise ratio required for each band is obtained by using aural masking to perform fixed bit allocation. However, in this method, even if characteristics are measured by a sine wave input, a characteristic value is not always good because the bit allocation is fixed.
In order to solve this problem, a high-efficient coding apparatus having the following arrangement is proposed. That is, all bits which can be used in bit allocation are separately used for a fixed allocation pattern predetermined in each of sub-blocks obtained by dividing each of the above blocks and for bit distribution depending on the size of a signal in each block, and the division ratio is made to be dependent on a signal related to an input signal. For example, the division ratio to the fixed bit allocation pattern is set to be large when the spectrum distribution of the signal is smooth.
According to this method, when energy is concentrated on a specific spectrum component as in a sine wave input, entire signal-noise characteristics can be considerably improved such that a large number of bits are allocated to a block including the spectrum component. In general, the hearing sense of a human being is very sensitive with respect to a signal having a sharp spectrum distribution. For this reason, when the signal-noise characteristics are improved by using such a method, not only is a numeral value on measurement is improved, but also aural tone quality is effectively improved.
Various types of method for bit allocation other than the above are proposed. When a model related to hearing sense is made accurate, and the capability of the coding apparatus is improved, coding which is aurally efficient can be performed.
In this case, when the DFT or DCT is used as a method of performing spectrum transform to a waveform signal consisting of waveform elements (sample data) such as digital audio signals in a time area, blocks are constituted in units of M sample data, and spectrum transform for the DFT or DCT is performed. When the spectrum transform is performed to such blocks, M independent real number data (DFT coefficient data or DCT coefficient data) are obtained. The M real number data obtained as described above are quantized and coded to be coded data.
When reproduction waveform signal is reproduced by decoding the coded data, the coded data is decoded to be inversely quantized, and inverse spectrum transform by inverse DFT or inverse DCT is performed to the obtained real data in units of blocks corresponding to the blocks in the coding operation to obtain waveform element signals. The blocks constituted by the waveform element signals are connected to each other to reproduce the waveform signal.
The reproduced waveform signal generated as described above has connection distortion generated in block connection, and is not aurally preferable. For this reason, in order to reduce connection distortion between the blocks, when spectrum transform is performed using DFT or DCT in actual coding, M1 sample data of adjacent blocks are made to overlap, and these sample data are subjected to the spectrum transform.
When spectrum transform is performed such that M1 sample data of the adjacent blocks overlap, M real number data are obtained with respect to (M-M1) (average number) sample data. As a result, the number of real number data obtained by the spectrum transform is larger than the number of original sample data actually used in the spectrum transform. Since the real number data are to be quantized and coded later, it is not preferable on coding efficiency that the number of real number data obtained by the spectrum transform is larger than the number of original sample data as described above.
In contrast to this, when the MDCT is used as a method of performing spectrum transform to a waveform signal constituted by sample data such as digital audio signals, spectrum transform is performed by using 2M sample data obtained by making M sample data of adjacent blocks to overlap to reduce connection distortion between the blocks, thereby obtaining M independent real number data (MDCT coefficient data). For this reason, in the spectrum transform by the MDCT, M real number data are obtained with respect to M (average number) sample data. Therefore, coding which is more efficient than the spectrum transform using the DFT or DCT can be performed.
When coded data obtained by quantizing and coding the obtained real number data by using the MDCT spectrum transform is decoded to generate a reproduced waveform signal , the coded data is decoded to be inversely quantized. Inverse spectrum transform by inverse MDCT is performed to the obtained real number to obtain waveform elements in the blocks, and the waveform elements in the blocks are added to each other while being interfered with each other, thereby reconstructing the waveform signal.
In this case, in general, when the length (dimension of a block in a dime direction) of a block for spectrum transform is increased, a frequency resolving power is improved. When a waveform signal such as a digital audio signal is subjected to spectrum transform in such a long block, energy is concentrated on a specific spectrum component. As described above, when the spectrum transform is performed to adjacent blocks which overlap over a large length, the inter-block distortion of the waveform signal can be preferably reduced.
In addition, when spectrum transform is performed to blocks in which a half number of sample data of adjacent blocks overlap, and the MDCT in which the number of real number data obtained by the spectrum transform does not increase with respect to the number of sample data of the original waveform signal is used, coding which is more efficient than that by the spectrum transform using the DFT or DCT can be performed.
When a method in which the waveform signal is divided into blocks, each block is resolved into spectrum components (real number data obtained by the spectrum transform in the above example), and the obtained spectrum components are quantized and coded is used, quantization noise is generated in a waveform signal obtained such that signals constituted by the spectrum components are decoded and synthesized in each block.
If the original waveform signal includes a portion in which a signal component sharply changes (transition portion whose waveform element level sharply changes), and the waveform signal is temporarily coded and then decoded, large quantization noise caused by the transition portion may extend in a portion of the original waveform signal other than the transition portion.
Assume that the following waveform signal SW1 is used as a coded audio signal. That is, in the waveform signal SW1, as shown in FIG. 7A, an attack portion AT in which sound sharply increases as the transition portion is present next to a quasi stationary signal FL which slightly changes and has a low level, and signals each having a high level are subsequent to the attack portion AT. The waveform signal SW1 is divided into blocks each having a unit time width, and a signal component in each block is subjected to spectrum transform. When the obtained spectrum components are quantized and coded, and then subjected to inverse spectrum transform, decoding, and inverse quantization, the reproduced waveform signal SW1 includes large quantization noise QN1 caused by the attack portion AT in all the blocks as shown in FIG. 11C.
For this reason, as shown in FIG. 7C, in the portion of the quasi stationary signal FL before the attack portion AT, large (e.g., higher level than that of the quasi stationary signal FL) quantization noise QN1 caused by the attack portion AT appears.
Since the quantization noise QN1 appearing in the quasi stationary signal FL before the attack portion AT is not shielded by simultaneous masking performed by the attack portion AT, the quantization noise QN1 acts as aural hindrance. As described above, the quantization noise QN1 appearing before the attack portion AT in which sound sharply increases is generally called a pre-echo.
When the signal components in the blocks are subjected to spectrum transform, the spectrum transform is performed after the blocks are multiplied by a transform window function (window function) TW having a characteristic curve whose end portions moderately change as shown in FIG. 7B. In this manner, a spectrum distribution is prevented from extending in a wide area.
In particular, when a waveform signal is subjected to spectrum transform in a long block to improve the frequency resolving power as described above, a time resolving power is degraded, and a pre-echo may be generated for a long period of time.
In this case, the block in spectrum transform is shortened, the period of time in which the quantization noise is generated is also shortened. For this reason, for example, if the length of the block subjected to spectrum transform near the attack portion is decreased, the period of time in which a pre-echo can be shortened, and aural hindrance caused by the pre-echo can be reduced.
More specifically, a case wherein the pre-echo is prevented by shortening the block near the attack portion will be described below. Near the transition portion such as the attack portion AT in which the magnitude of sound sharply changes in the waveform signal SW including the quasi stationary signal FL and the attack portion AT as shown in FIG. 2A, a block for spectrum transform is shortened, the spectrum transform is performed to a signal component in the short block. As a result, a period of time in which a pre-echo is generated can be sufficiently shortened in the short block.
If the period of time in which a pre-echo is generated in the block can be sufficiently shortened, an aural hinderance can be reduced by a so-called inverse masking effect obtained by the attack portion AT. In this short block, when a signal component in the short block is to be subjected to spectrum transform, the signal component is subjected to the spectrum transform after the signal component is multiplied by a short transform window function (short transform window function TWS) as shown in FIG. 2B.
When the block for spectrum transform is shortened with respect to signal portions subsequent to the portion of the quasi stationary signal FL and the attack portion AT, a frequency resolving power is degraded, and coding efficiency in these portions is also degraded. For this reason, when the block for spectrum transform is increased in length with respect to these portions, energy is concentration on a specific spectrum component.
As a result, coding efficiency is desirably improved.
For these reasons, in fact, the length of the block for spectrum transform is selectively switched depending on the nature of each portion of the waveform signal SW. When the length of the block is selectively switched as described above, the transform window function TW is also switched depending on the selection of the length of the block. For example, the following selective switching operation is performed. That is, as shown in FIG. 2B, a long transform window function (long transform window function TWL) is used for a block constituted by the quasi stationary signal FL except for a portion near the attack portion AT, and a short transform window function (short transform window function TWS) is used for a short block near the attack portion AT.
However, as described above, when the method in which the length of a block in spectrum transform is switched depending on the nature (characteristics) of each portion of the waveform signal is realized on an actual arrangement, a spectrum transforming means which can cope with spectrum transform in blocks having different lengths must be arranged in the coding apparatus. In addition, an inverse spectrum transform means which can perform inverse spectrum transform which can cope with blocks having different lengths must be arranged in the decoding apparatus.
The length of a block in spectrum transform is to be changed, the number of spectrum components obtained by the spectrum transform is in proportion to the length of the block. These spectrum components are coded in units of critical bands, the number of spectrum components included in each critical band changes depending on the length of a block. For this reason, a coding process to be described later and a decoding process become complexed.
In this manner, in the method in which the length of a block in spectrum transform is made variable, both the coding apparatus and the decoding apparatus are disadvantageously complexed.
For this reason, when the spectrum transform such as the DFT or DCT is applied to resolve a block into spectrum components, as a method in which a pre-echo can be effectively prevented from being generated while keeping the length of a block in the spectrum transform constant to assure a sufficiently high frequency resolving power, a technique disclosed in U.S. Pat. No. 5,117,228 is known. In this publication, the following method is disclosed. That is, an input signal waveform is cut into blocks constituted by a plurality of sample data in the coding apparatus, and the blocks are multiplied by a window function. Thereafter, an attack portion is detected, a small-amplitude waveform signal (i.e., quasi stationary signal) immediately before the attack portion is amplified, and spectrum components (real number data) are obtained by spectrum transform using DFT or DCT. Then, the spectrum components are coded.
In decoding corresponding to the above coding, the decoded spectrum components are subjected to inverse spectrum transform performed by inverse DFT (=IDFT) or inverse DCT (=IDCT), and a process of correcting amplification of a signal immediately before the attack portion in coding is performed. In this manner, a pre-echo is prevented from being generated. Since the length of a block subjected to spectrum transform can be kept constant by using the above method, the arrangements of the coding and decoding apparatuses can be simplified.
According to the technique described in the publication, by using a gain control process performed in coding and for a small-amplitude signal immediately before an attack portion and a gain control correction process performed in decoding and corresponding to gain control performed to a signal immediately before the attack portion in coding, a pre-echo can be prevented from being generated while keeping the length of the block in spectrum transform constant.
For example, in the specification and drawings of U.S. Ser. No. 08/604,479 (filed on Feb. 21, 1996) applied by the present applicant, the present applicant has proposed a method of preventing not only a pre-echo but also a post-echo. In this specification and drawings, the following is proposed. That is, in a method and apparatus for coding a waveform signal, an attack portion in which the levels of the waveform elements of the waveform signal sharply rise is detected, and a release portion in which the levels of the waveform elements of the waveform signal sharply lower is detected. An adaptive gain control amount is selected depending on the characteristics of the waveform signal from a plurality of gain control amounts for a waveform element before at least the attack portion and a waveform element of the release portion, and gain control is performed to the waveform element before at least the attack portion and the waveform element of the release portion by using the selected gain control amount. The waveform signal is transformed into a plurality of frequency components (spectrum components), and control information for the gain control and a plurality of frequency components are coded.
More specifically, according to the coding method and apparatus, the attack and release portions are detected from the waveform signal, gain control is performed to the portion before the attack portion and the waveform element of the release portion at the gain control amount adaptively selected depending on the characteristics of the waveform signal, and the portion before the attack portion and the waveform element of the release portion are coded. In decoding, gain control correction is performed to the portion subjected to gain control in coding. For this reason, the energy of noise generated in the portion before the attack portion and the release portion when the waveform signal is coded and decoded can be lowered to a level at which a human being cannot easily sense the noise.
However, in a method of preventing generation of a pre-echo and a post-echo using the gain control and the gain control correction, a waveform signal obtained by amplifying small-amplitude waveform signals before and after the attack portion is used in coding. For this reason, a spectrum component which is coded and subjected to the gain control is considerably different from the spectrum component of actually reproduced voice.
For example, as a waveform signal in which the pre-echo and post-echo are generated, like a waveform signal SW shown in FIG. 3A, a waveform signal in which an attack portion AT is subsequent to a quasi stationary signal FL, and a release portion RE whose level sharply lowers is subsequent to the quasi stationary signal FL will be explained. As in the gain control function GC shown in FIG. 3A, assume that gain control having a gain control amount which is Ra times is performed to a signal component (waveform signal FL) serving as a portion immediately before the attack portion AT, and that gain control having a gain control amount which is Rr times is performed to the release portion RE after the attack portion AT. As a result, in fact, a waveform signal SW' shown in FIG. 3C is coded.
On the other hand, assume that a masking curve calculated using an psycho-acoustic model for spectrum components obtained by transforming the waveform signal SW shown in FIG. 3A is a curve MCa indicated by a dotted line in FIG. 4A. In this case, the relationship between spectrum components for the waveform signal SW' actually used for coding and the masking curve is calculated using the same psycho-acoustic model. As a result, for example, a masking curve MCb shown in FIG. 4B is obtained.
More specifically, when the two types of masking curves MCa and MCb are compared with each other with respect to the spectrum components obtained by transforming the waveform signal SW shown in FIG. 4A, the masking curves MCa and MCb are partially different from each other as in a masking curve shown in FIG. 4C. In FIG. 4C, a curve indicated by a dotted line shows the masking curve MCa, and a curve indicated by a solid line shows the masking curve MCb. FIG. 4C shows the following. That is, when the levels of the signal FL immediately before the attack portion and the signal RE of the release portion are raised by the gain control, the levels of the spectrum components are uniformly amplified in a wide frequency band, and a masking curve which has a level partially different from a level at which masking is actually performed by the waveform signal SW and which is improper as an psycho-acoustic model is calculated. As a result, when the signal coded using the masking curve MCb is decoded by the decoding apparatus, another quantization noise different from a pre-echo or a post-echo is generated.