This invention relates to a signal encoding method and apparatus for encoding acoustic signals, such as digital audio signals, by so-called high efficiency encoding, and a recording medium having the encoded signals recorded thereon. The invention also relates to a method for transmitting the encoded signals, and a signal decoding apparatus for decoding the encoded signals.
There exist a variety of high efficiency encoding techniques of encoding audio or speech signals. Examples of these techniques include transform coding in which a frame of digital signals representing the audio signal on the time axis is converted by an orthogonal transform into a block of spectral coefficients representing the audio signal on the frequency axis, and a sub-band coding in which the frequency band of the audio signal is divided by a filter bank into a plurality of sub-bands without forming the signal into frames along the time axis prior to coding. There is also known a combination of sub-band coding and transform coding, in which digital signals representing the audio signal are divided into a plurality of frequency ranges by sub-band coding, and transform coding is applied to each of the frequency ranges.
Among the filters for dividing a frequency spectrum into a plurality of equal-width frequency ranges, there is the quadrature mirror filter (QMF) as discussed in R. E. Crochiere, Digital Coding of Speech in Sub-bands, 55 Bell Syst. Tech J. No. 8 (1976). With such QMF filter, the frequency spectrum of the signal is divided into two equal-width bands. With the QMF, aliasing is not produced when the frequency bands resulting from the division are subsequently combined together.
In "Polyphase Quadrature Filters- A New Subband Coding Technique", Joseph H. Rothweiler ICASSP 83, Boston, there is shown a technique of dividing the frequency spectrum of the signal into equal-width frequency bands. With the present polyphase QMF, the frequency spectrum of the signals can be divided at a time into plural equal-width frequency bands.
There is also known a technique of orthogonal transform including dividing the digital input audio signal into frames of a predetermined time duration, and processing the resulting frames using a discrete Fourier transform (DFT), discrete cosine transform (DCT) and modified DCT (MDCT) for converting the signal from the time axis to the frequency axis. Discussions on MDCT may be found in J. P. Princen and A. B. Bradley, "Subband Transform Coding Using Filter Bank Based on Time Domain Aliasing Cancellation", ICASSP 1987.
By quantizing the signals divided on the band basis by the filter or orthogonal transform, it becomes possible to control the band subjected to quantization noise and psychoacoustically more efficient coding may be performed by utilizing the so-called masking effects. If the signal components are normalized from band to band with the maximum value of the absolute values of the signal components, it becomes possible to effect more efficient coding.
For quantizing signals split into plural frequency bands, it is known to divide the frequency spectrum into plural frequency bands taking into account the psychoacoustic characteristics of the human hearing mechanism. That is, spectral coefficients representing an audio signal on the frequency axis may be divided into a plurality of, for example, 25, critical frequency bands. The width of the critical bands increase with increasing frequency.
For encoding signals of the respective frequency bands, a pre-set number of bits are allocated from one frequency band to another, or encoding by adaptive bit allocation is performed from one frequency band to another. For example, when applying adaptive bit allocation to the spectral coefficient data resulting from MDCT, the spectral coefficient data generated by the MDCT within each of the critical bands is quantized using an adaptively allocated number of bits.
There are presently known the following two bit allocation techniques. For example, in IEEE Transactions of Acoustics, Speech and Signal Processing, vol. ASSP-25, No.4, August 1977, bit allocation is carried out on the basis of the amplitude of the signal in each critical band. This technique produces a flat quantization noise spectrum and minimizes the noise energy, but the noise level perceived by the listener is not optimum because the technique does not effectively exploit the psychoacoustic masking effect.
In the bit allocation technique described in M. A. Krassner, The Critical Band Encoder- Digital Encoding of the Perceptual Requirements of the Auditory System, ICASSP 1980, the psychoacoustic masking mechanism is used to determine a fixed bit allocation that produces the necessary signal-to-noise ratio for each critical band. However, if the signal-to-noise ratio of such a system is measured using a strongly tonal signal, for example, a 1 kHz sine wave, non-optimum results are obtained because of the fixed allocation of bits among the critical bands.
For overcoming these inconveniences, a high efficiency encoding apparatus has been proposed in which the total number of bits available for bit allocation is divided between a fixed bit allocation pattern pre-set for each small block and a block-based signal magnitude dependent bit allocation, and the division ratio is set in dependence upon a signal which is relevant to the input signal, such that, the smoother the signal spectrum, the higher becomes the division ratio for the fixed bit allocation pattern, that is the smaller becomes the division ratio for block-based signal magnitude dependent bit allocation.
With this technique, if the energy is concentrated in a particular spectral component, as in the case of a sine wave input, a larger number of bits are allocated to the block containing the spectral component, for significantly improving the signal-to-noise characteristics in their entirety. Since the human auditory system is highly sensitive to a signal having acute spectral components, such technique may be employed for improving the signal-to-noise ratio for improving not only measured values but also the quality of the sound as perceived by the ear.
In addition to the above techniques, a variety of other techniques have been proposed, and the model simulating the human auditory system has been refined, such that, if the encoding device is improved in its ability, encoding may be made with higher efficiency in light of the human auditory system.
If DFT or DCT is utilized as the method for transforming the waveform signal (sample data) such as the time-domain digital audio signals, into a spectral signal, transform is executed using a time block made up of M sample data, and orthogonal transform such as DFT or DCT is carried out on the block basis. Such block-based orthogonal transform produces M independent real-number data (DFT coefficient data or DCT coefficient data). The M real-number data, thus produced, are subsequently quantized and encoded to give encoded data.
For decoding the encoded data to regenerate playback acoustic signals, the encoded data are decoded and dequantized to give real-number data, which then is inverse orthogonal-transformed by IDFT or IDCT. The resulting blocks made up of waveform element signals are linked together for regenerating acoustic signals.
The playback acoustic signals, thus generated, suffer from psychoacoustically undesirable linking distortion caused by block linking. For reducing the inter-block linking distortion, Ml sample data of both neighboring blocks are overlapped at the time of orthogonal transform employing DFT or DCT.
However, if M1 sample data each are overlapped on both neighboring blocks for carrying out orthogonal transform, M sample data are produced for (M-M1) sample data on an average, so that the number of real-number data obtained on orthogonal transform is larger than the number of the original sample data employed for orthogonal transform. Since the real-number data are subsequently quantized and encoded, such increase in the number of the real-number data obtained in orthogonal transform beyond the number of the original sample data is not desirable in view of the coding efficiency.
If MDCT is employed for orthogonal transform of acoustic data consisting of sample data such as digital audio signals, orthogonal transform is carried out using 2 M sample data by overlapping M sample data on both neighboring blocks, for reducing the inter-block linking distortion for producing independent M real-number data (MDCT coefficient data). In this manner, M real-number data are obtained for M sample data on an average with MDCT so that higher efficiency encoding may be realized than with DFT or DCT.
For decoding the encoded data obtained on quantizing and encoding the real-number data by MDCT for generating playback acoustic signals, the encoded data is decoded and dequantized to give real-number data which is then inverse orthogonal-transformed by IMDCT on the basis of blocks corresponding to the overlapped blocks at the time of encoding to produce in-block waveform elements. These in-block waveform elements are added together with interference for reconstructing acoustic signals.
In general, if the length of a block for orthogonal transform (size of the block along time axis) for orthogonal transform is increased, frequency resolution is improved. If the acoustic signals, such as digital audio signals, are orthogonal-transformed using such long blocks, the signal energy is concentrated in specified spectral components. On the other hand, if orthogonal transform is performed for blocks in which sufficiently long overlap is accorded in both neighboring blocks, inter-block distortion of acoustic signals may be reduced satisfactorily. If orthogonal transform is performed by MDCT on blocks in which the number of sample data equal to one-half the number of sample data of a block are overlapped between the neighboring blocks, and if the number of the real-number data obtained on orthogonal transform is not increased as compared to the number of the original acoustic signals, a higher encoding efficiency may be achieved than in the case of orthogonal transform employing DFT and DCT.
Meanwhile, if the acoustic signals are blocked and resolved on the block basis into spectral components (real-number data obtained by an orthogonal transform in the previous example) and the resulting spectral components are quantized and encoded, the quantization noise is produced in the acoustic signals subsequently produced on block-based synthesis.
If the original acoustic signals contain signal components with acutely changing signal levels, that is portions with acutely changing levels (transient portions) in the waveform elements, and such acoustic signals are encoded and subsequently decoded, the quantization noise corresponding to the transient portions is spread to portions of the original acoustic signal other than the transient portions.
If an acoustic signal SW, such as an audio signal shown in FIG. 1A, is employed, the above-mentioned transient portion of the acoustic signal SW is an attack portion AT in which the sound is increased in intensity. The signal temporally previous to the attack portion AT consists of a sub-stationary signal FL which is generally low in changes and in signal level. If the acoustic signal SW containing the sub-stationary signal FL and the attack portion AT is blocked with a time width as shown in FIG. 1A, and the signal components in the block are orthogonally transformed, quantized and encoded, the acoustic signal SW produced subsequent to inverse orthogonal transform, decoding and dequantization is such a signal in which the large quantization noise QN attributable to the attack portion AT is diffused to the entire block.
The result is that the large quantization noise QN higher in level than the sub-stationary signal FL is produced due to the attack portion AT in the portion of the sub-stationary signal FL temporally previous to the attack portion AT. The quantization noise QN appears in the portion of the sub-stationary signal FL temporally previous to the attack portion A. This is not masked by concurrent masking by the attack portion AT, thus proving obstructions to the hearing sense.
Such quantization noise QN, appearing in the signal portion previous to the attack portion with the rapidly increased signal level is generally known as pre-echo. If the acoustic signal is orthogonally transformed over a long block for increasing the frequency resolution as described above, time resolution is worsened, thus occasionally producing pre-echo over a longer time period. For orthogonal transform of in-block signal components, a transform windowing function TW, having a characteristic curve which is smoothly changed at both end portions, as shown in FIG. 1B, is applied to the block prior to orthogonal transform for prohibiting the spectral distribution from being diffused over an excessively wide range.
If the block length for orthogonal transform is reduced in the vicinity of the attack portion, the time period of pre-echo generation may be reduced for reducing the acoustic obstructions otherwise caused by the pre-echo. That is, if, for the acoustic signal SW having the sub-stationary signal FL and the attack portion AT as shown for example in FIG. 2A, the block length for orthogonal transform is reduced in the vicinity of the transient portion with acutely changing signal amplitudes, such as the attack portion AT, and orthogonal transform is applied to signal components in the short block, the time duration of pre-echo may be sufficiently reduced in the short block. If the time duration of pre-echo is sufficiently reduced in the block, obstructions to the hearing sense may be reduced in the attack portion AT by the so-called reverse-direction masking effect. Meanwhile, when orthogonally transforming the signal components in the short block, a transform windowing function of short duration (short length transform window function TWS) as shown for example in FIG. 2B is applied before processing by orthogonal transform.
On the other hand, if the block length for orthogonal transform for the sub-stationary signal portion FL or the attack portion AT is similarly reduced, the frequency resolution is lowered, thus lowering the encoding efficiency for these portions. Thus a longer block length for orthogonal transform is preferably increased for orthogonal transform for these portions since the signal energy is concentrated for specified spectral components and hence the encoding efficiency is improved.
Thus it is actually practiced to selectively change over the block length for orthogonal transform depending upon characteristics of various portions of the acoustic signals SW. For effecting such selective switching of the block length, the transform windowing function TW is similarly switched depending upon the block length selection. For example, a longer transform windowing function TWL is used for blocks consisting of signals of the sub-stationary portion FL and signals subsequent to the attack portion AT, while a shorter transform windowing function TWS is used for the block in the vicinity of the attack portion AT, by way of performing selective switching.
By increasing the length of the block for orthogonal transform in the sub-stationary portion or in the attack and subsequent portions, and by reducing the length of the block for orthogonal transform only in the vicinity of the attack portion with acutely changing signal levels, at the cost of the frequency resolution, sufficient frequency resolution may be maintained in the portions other than the attack portion, while the time duration of the pre-echo in the attack portion may be sufficiently reduced. If the time duration of the pre-echo in the attack portion can be sufficiently reduced in this manner, the pre-echo may be masked by the reverse-direction masking by the attack portion, so that psychoacoustically unobjectionable encoding may be achieved.
However, if the method for selectively switching the block length for orthogonal transform depending upon the properties or characteristics of respective components of the acoustic signals, orthogonal transform means capable of dealing successfully with orthogonal transform for different block lengths need to be provided in the encoder, while inverse orthogonal transform means capable of dealing successfully with orthogonal transform for different block lengths need to be provided in the decoder.
If the block length for orthogonal transform is changed, the number of spectral components obtained on orthogonal transform is proportional to the block length, such that, if these spectral components are grouped and encoded on, for example, the critical band basis, the numbers of the spectral components contained in the critical bands are also changed with the block lengths, thus complicating the subsequent encoding and decoding operations.
Said differently, the method of changing the block lengths for orthogonal transform has a drawback such that both the encoder and the decoder are increased in circuit scale.