1. Field of the Invention
This invention relates to a signal encoding method and apparatus for encoding input digital signals by the so-called high efficiency encoding, and a recording medium having the encoded signals recorded thereon. The invention also relates to a method for transmitting the encoded signals, and a signal decoding apparatus for decoding the encoded signals.
2. Description of the Related Art
There exist a variety of high efficiency encoding techniques of encoding audio or speech signals. Examples of these techniques include transform coding in which a frame of digital signals representing the audio signal on the time axis is converted by an orthogonal transform into a block of spectral coefficients representing the audio signal on the frequency axis, and a sub-band coding in which the frequency band of the audio signal is divided by a filter bank into a plurality of sub-bands without forming the signal into frames along the time axis prior to coding. There is also known a combination of sub-band coding and transform coding, in which digital signals representing the audio signal are divided into a plurality of frequency ranges by sub-band coding, and transform coding is applied to each of the frequency ranges.
Among the filters for dividing a frequency spectrum into a plurality of equal-width frequency ranges, there is the quadrature mirror filter (QMF) as discussed in R. E. Crochiere, Digital Coding of Speech in Sub-bands, 55 Bell Syst. Tech J. No.8 (1976). With such QMF filter, the frequency spectrum of the signal is divided into two equal-width bands. With the QMF, aliasing is not produced when the frequency bands resulting from the division are subsequently combined together.
In "Polyphase Quadrature Filters- A New Subband Coding Technique", Joseph H. Rothweiler ICASSP 83, Boston, there is shown a technique of dividing the frequency spectrum of the signal into equal-width frequency bands. With the present polyphase QMF, the frequency spectrum of the signals can be divided at a time into plural equal-width frequency bands.
There is also known a technique of orthogonal transform including dividing the digital input audio signal into frames of a predetermined time duration, and processing the resulting frames using a discrete Fourier transform (DFT), discrete cosine transform (DCT) and modified DCT (MDCT) for converting the signal from the time axis to the frequency axis. Discussions on MDCT may be found in J. P. Princen and A. B. Bradley, Subband Transform Coding Using Filter Bank Based on Time Domain Aliasing Cancellation", ICASSP 1987.
By quantizing the signals divided on the band basis by the filter or orthogonal transform, it becomes possible to control the band subjected to quantization noise and psychoacoustically more efficient coding may be performed by utilizing the so-called masking effects. If the signal components are normalized from band to band with the maximum value of the absolute values of the signal components, it becomes possible to achieve more efficient coding.
For quantizing signals split into plural frequency bands, it is known to divide the frequency spectrum into plural frequency bands taking into account the psychoacoustic characteristics of the human hearing mechanism. That is, spectral coefficients representing an audio signal on the frequency axis may be divided into a plurality of, for example, 25, critical frequency bands. The width of the critical bands increase with increasing frequency.
For encoding signals of the respective frequency bands, a pre-set number of bits are allocated from one frequency band to another, or encoding by adaptive bit allocation is performed from one frequency band to another. For example, when applying adaptive bit allocation to the spectral coefficient data resulting from MDCT, the spectral coefficient data generated by the MDCT within each of the critical bands is quantized using an adaptively allocated number of bits.
There are presently known the following two bit allocation techniques. For example, in IEEE Transactions of Acoustics, Speech and Signal Processing, vol.ASSP-25, No.4, August 1977, bit allocation is carried out on the basis of the amplitude of the signal in each critical band. This technique produces a flat quantization noise spectrum and minimizes the noise energy, but the noise level perceived by the listener is not optimum because the technique does not effectively exploit the psychoacoustic masking effect.
In the bit allocation technique described in M. A. Krassner, The Critical Band Encoder--Digital Encoding of the Perceptual Requirements of the Auditory System, ICASSP 1980, the psychoacoustic masking mechanism is used to determine a fixed bit allocation that produces the necessary signal-to-noise ratio for each critical band. However, if the signal-to-noise ratio of such a system is measured using a strongly tonal signal, for example, a 1 Khz sine wave, non-optimum results are obtained because of the fixed allocation of bits among the critical bands.
For overcoming these inconveniences, a high efficiency encoding apparatus has been proposed in which the total number of bits available for bit allocation is divided between a fixed bit allocation pattern pre-set for each small block and a block-based signal magnitude dependent bit allocation. The division ratio is set in dependence upon a signal which is relevant to the input signal, such that, the smoother the signal spectrum, the higher becomes the division ratio for the fixed bit allocation pattern, that is the smaller becomes the division ratio for block-based signal magnitude dependent bit allocation.
With this technique, if the energy is concentrated in a particular spectral component, as in the case of a sine wave input, a larger number of bits are allocated to the block containing the spectral component, for significantly improving the signal-to-noise characteristics in their entirety. Since the human auditory system is highly sensitive to a signal having acute spectral components, such technique may be employed for improving the signal-to-noise ratio for improving not only measured values but also the quality of the sound perceived by the listener.
In addition to the above techniques, a variety of other techniques have been proposed, and the model simulating the human auditory system has been refined, such that, if the encoding device is improved in its ability, encoding may be made with higher efficiency in light of the human auditory system.
If DFT or DCT is utilized as the method for transforming the waveform signal (sample data) such as the time-domain digital audio signals, into a spectral signal, a transform is executed using a time block made up of M sample data, and orthogonal transform such as DFT or DCT is carried out on the block basis. Such block-based orthogonal transform produces M independent real-number data (DFT coefficient data or DCT coefficient data). The M real-number data, thus produced, are subsequently quantized and encoded to give encoded data.
For decoding the encoded data to regenerate playback acoustic signals, the encoded data are decoded and dequantized to give real-number data, which then is inverse orthogonal-transformed by IDFT or IDCT. The resulting blocks made up of waveform element signals are linked together for regenerating acoustic signals.
The playback acoustic signals, thus generated, suffer from psychoacoustically undesirable linking distortion caused by block linking. For reducing the inter-block linking distortion, M1 sample data of both neighboring blocks are overlapped at the time of orthogonal transform by DFT or DCT.
However, if Mi sample data each are overlapped on both neighboring blocks for carrying out orthogonal transform, M sample data are produced for (M-M1) sample data on an average, so that the number of real-number data obtained on orthogonal transform is larger than the number of the original sample data employed for orthogonal transform. Since the real-number data are subsequently quantized and encoded, such increase in the number of the real-number data obtained on orthogonal transform beyond the number of the original sample data is not desirable in view of the coding efficiency.
If MDCT is employed for orthogonal transform of acoustic data consisting of sample data such as digital audio signals, orthogonal transform is carried out using 2M sample data by overlapping M sample data on both neighboring blocks, for reducing the inter-block linking distortion for producing independent M real-number data (MDCT coefficient data). In this manner, M real-number data are obtained for M sample data on an average with MDCT so that higher efficiency encoding may be realized than with DFT or DCT.
For decoding the encoded data obtained on quantizing and encoding the real-number data by MDCT for generating playback acoustic signals, the encoded data is decoded and dequantized to give real-number data which is then inverse orthogonal-transformed by IMDCT on the basis of blocks corresponding to the overlapped blocks at the time of encoding to produce in-block waveform elements. These in-block waveform elements are added together with interference for reconstructing acoustic signals.
In general, if the length of a block for orthogonal transform (size of the block along time axis) for orthogonal transform is increased, frequency resolution is improved. If the acoustic signals, such as digital audio signals, are orthogonal-transformed using such long blocks, the signal energy is concentrated in specified spectral components. On the other hand, if orthogonal transform is performed for blocks in which sufficiently long overlap is accorded in both neighboring blocks, inter-block distortion of acoustic signals may be reduced satisfactorily. If orthogonal transform is performed by MDCT on blocks in which the number of sample data equal to one-half the number of sample data of a block are overlapped between the neighboring blocks, and if the number of the real-number data obtained on orthogonal transform is not increased as compared to the number of the original acoustic signals, a higher encoding efficiency may be achieved than in the case of orthogonal transform employing DFT and DCT.
Meanwhile, if the acoustic signals are blocked and resolved on the block basis into spectral components (real-number data obtained on orthogonal transform in the previous example) and the resulting spectral components are quantized and encoded, the quantization noise is produced in the acoustic signals subsequently produced at the time of block-based synthesis.
If the original acoustic signals contain signal components with abruptly changing signal levels, that is portions with abruptly changing levels (transient portions) in the waveform elements, and such acoustic signals are encoded and subsequently decoded, the quantization noise corresponding to the transient portions is spread to portions of the original acoustic signal other than the transient portions.
It is assumed that, as audio signals to be encoded, a waveform signal SW1 is employed, in which a quasi-stationary signal FL exhibiting only slight transition and low levels is followed by an attack portion AT with abruptly increasing sound level, as a transient portion, followed in turn by a succession of high level signals, as shown in FIG. 1A. If such waveform signal SW1 is blocked in a unit time width, signal components in each block are orthogonally transformed, and the resulting spectral signal components are quantized and encoded so as to be then inverse orthogonally transformed, decoded and dequantized, there is produced a waveform signal SW1 in which a larger quantization noise QN1 ascribable to the attack portion AT is superimposed over the entire block, as shown in FIG. 1C. The result is that the larger quantization noise QN1, higher in level than the quasi-stationary signal FL, temporally previous to the attack portion AT, is produced due to the attack portion AT in the quasi-stationary signal FL, as shown in FIG. 4C. The quantization noise QN1, appearing in the quasi-stationary signal portion, temporally previous to the attack portion AT, cannot be masked by concurrent masking by the attack portion AT and hence proves hindrance to the hearing sense. Such quantization noise QN1, appearing ahead of the attack portion AT where the sound level rises abruptly, is generally termed pre-echo. For orthogonal transform of signal components in each block, the block is multiplied prior to orthogonal transform by a transform windowing function TW having a characteristic curve of being smoothly sloped at both skirt portions for prohibiting the spectral distribution from being spread over a wide range.
In particular, if waveform signals are orthogonally transformed using a long block length for improving the frequency resolution as described previously, time resolution is lowered, thus generating pre-echo continuing for a prolonged time.
If the block length for orthogonal transform is reduced, the time period of generation of the quantization noise is reduced. Thus, if the block length for orthogonal transform is reduced in the vicinity of the attack portion, the time period of generation of pre-echo may be reduced, thus diminishing the hindrance to the hearing sense caused by pre-echo.
Referring to prevention of pre-echo by reducing the block length in the vicinity of the attack portion, the block for orthogonal transform may be reduced in length in the vicinity of the transient portion, such as the attack portion AT with abruptly increased sound level, in the waveform signal SW having the quasi-stationary signal FL in addition to the attack portion AT as shown in FIG. 2A, and orthogonal transform may be applied to signal components within the short block. In this manner, the time period of generation of pre-echo may be reduced sufficiently within the short block. If the time period of generation of pre-echo in a block can be reduced sufficiently, it becomes possible to reduce the hindrance to the hearing sense by the so-called backward masking effect by the attack portion AT. If orthogonal transform is applied to the signal components in the short block, the transform windowing function TWS as shown in FIG. 2B is applied before proceeding to orthogonal transform.
On the other hand, if the block length for orthogonal transform is reduced for the quasi-stationary signal FL and for signal portions downstream of the attack portion AT, frequency resolution is lowered thus lowering the encoding efficiency for these signal portions. Thus, it is preferred to increase the block length for orthogonal transform for these signal portions since the energy is then concentrated in particular spectral components thus raising the encoding efficiency.
Thus, in effect, the block length for orthogonal transform is selectively switched for orthogonal transform depending upon the properties of various portions of the waveform signals SW. If the block length is selectively switched in this manner, the transform windowing function is similarly switched depending upon the selected block length. For example, the transform windowing function TW is selectively switched so that a long transform windowing function TWL is applied for a block consisting of the quasi-stationary signal SL excluding the neighborhood of the attack portion AT, and a short transform windowing function TWS is applied to a short block in the neighborhood of the attack portion AT, as shown in FIG. 2B.
However, if desired to implement the method of selectively switching the block length for orthogonal transform depending upon the characteristics of the various portions of the waveform signals in an actual configuration, it becomes necessary to provide orthogonal transform means capable of dealing with orthogonal transform with blocks of different lengths in an encoding apparatus, while it also becomes necessary to provide inverse orthogonal transform means capable of dealing with inverse orthogonal transform with blocks of different lengths in a decoding apparatus.
In addition, if desired to change the block length for orthogonal transform, the number of spectral components resulting from orthogonal transform is proportional to the block length, such that, if these spectral components are grouped together in terms of critical bands as units for encoding, the number of spectral components contained in the critical bands differs with block lengths, thus complicating the subsequent encoding and decoding operations.
In short, the method of varying the block length for orthogonal transform has a drawback that both the encoding apparatus and the decoding apparatus become complex in structure.
For effectively prohibiting the generation of pre-echo in the application of the above-mentioned orthogonal transform such as DFT or DCT for resolution into frequency components, as the block length for orthogonal transform is maintained at a constant value capable of assuring sufficient frequency resolution, there is disclosed such a technique as disclosed in, for example, JP Patent Kokai Publication 61-201526 or 63-7023, corresponding to European Patent Publication Nos. 0193143 and 0251028, which are not written in English.
In these EP publications, there is disclosed a method in which an input signal waveform is sliced at an interval of a block made up of plural data samples, a windowing function is applied to each block, an attack portion is detected, waveform signals of small amplitudes directly previous to the attack portion, that is quasi-stationary signals, are amplified and orthogonal transform, such as DFT or DCT, is applied to the amplified waveform signals to produce spectral components which are encoded.
For decoding, decoded spectral components are inverse orthogonal transformed by inverse DFT (IDFT) or inverse DCT (IDCT) and correction is made for amplification performed on the signals directly ahead of the attack portion at the time of encoding. This prohibits occurrence of the pre-echo. Since the block length for orthogonal transform may be perpetually maintained constant in this manner, the encoding apparatus and the decoding apparatus may be simplified in structure.
Referring to FIGS. 3A to 3C, the operating principle of encoding and decoding employing the windowing technique disclosed in the above publications is explained.
For encoding, the waveform signal SW shown in FIG. 3A is sliced in blocks each of a pre-set length and sample data is overlapped at either ends with both neighboring blocks. The waveform signals SW in the respective blocks are multiplied with transform windowing functions TWa to TWc (FIG. 3B) for prohibiting diffusion of the spectral distribution. It is then checked if there is any attack portion AT in each block where the input waveform signal SW is abruptly increased in amplitude. In the example of FIGS. 3A and 3B, since the attack portion AT exists in the block associated with the transform windowing function TWb, the signal components in this block are multiplied with a gain control function GCb as shown at (b) in FIG. 3C for amplification. The gain control function GCb is such a function which multiplies the signal of small amplitude directly ahead of the attack portion AT in the block, that is the quasi-stationary signal FL, by R, while multiplying the signal of the remaining portion with unity. In the example of FIGS. 3A to 3C, since there is no attack portion AT in the blocks associated with the transform windowing functions TWa and TWc, the signal components in these blocks are multiplied with unity by gain control functions GCa and GCc, respectively, for not performing signal amplification. The respective blocks are orthogonally transformed by DFT or DCT to produce spectral component signals which are encoded.
For decoding, decoded spectral components are inverse orthogonally transformed by IDFT or IDCT and corrected for gain control (amplification of small-amplitude signals) performed during encoding on the signals directly ahead of the attack portion.
With the above-described conventional technique, it becomes possible to prevent the pre-echo from occurring, with the block length for orthogonal transform remaining unchanged, by the gain control operation performed during encoding on the small amplitude signals directly ahead of the attack portion and by the corresponding gain control correction performed during decoding.
With the above-described method for preventing generation of pre-echo by gain control and gain control correction, the gain control amount for the attack portion is fixed, that is, a gain control function of multiplying the signal directly ahead of the attack portion with a fixed factor R on detection of the attack portion and a gain control function of multiplying the signal with unity on detection of no attack portion, are employed, in other words, two gain control functions of fixed values are alternatively employed in dependence upon detection of presence or absence of the attack portion. Thus it is difficult to prohibit the sound quality from being deteriorated especially in case of a higher compression ratio.
Next, it is assumed that, as an audio signal to be encoded, a waveform signal SW2 shown in FIG. 4A is employed, in which a quasi-stationary signal FL with little transition and with a low signal level is followed by the attack portion AT with an abruptly rising sound level as the transient portion followed in turn by a release portion RE with abruptly decreased sound level. Such waveform signal SW2 is blocked with a unit block time width and signal components in the block are orthogonally transformed to produce spectral components which are quantized and encoded. If the resulting signals are inverse orthogonally transformed, decoded and dequantized, the resulting waveform signals SW2 is overlaid with the large quantization noise over the entire block due to the attack portion AT. Thus, the large quantization noise due to the attack portion AT appearing in the quasi-stationary signal FL temporally previous to the attack portion AT and in the release portion RE temporally posterior to the attack portion AT, as shown in FIG. 4C. This quantization noise is larger in level than the quasi-stationary signal FL or the latter portion of the release portion RE. Such quantization noise QN2F appearing in the signal portion temporally previous to the attack portion AT, that is pre-echo, and the quantization noise QN2B, appearing in the signal portion temporally posterior to the attack portion AT, cannot be masked by concurrent masking by the attack portion AT, thus proving hindrance to the hearing sense. The quantization noise QN2B appearing after the attack portion AT is generally termed post-echo. The transform windowing function TW similar to that shown in FIG. 1B is also shown in FIG. 4B.
It is possible with the technique disclosed in the prior-art system to prevent the pre-echo from occurring, while it is not possible to prevent post-echo from occurring.