To encode an input audio signal, the audio signal is hitherto divided on the time axis into blocks for every predetermined time period (frame). The frames are subjected to modified discrete cosine transformation (MDCT), one by one. The time-series signal is thereby transformed to a spectral signal on the frequency axis. (So-called “spectrum transform” is carried out.) Thus, the audio signal is encoded.
To encode spectral signals, bits are allocated to each spectral signal that has been obtained by performing spectral transform on a time-series signal corresponding to one frame. Namely, a prescribed bit allocation or an adaptive bit allocation is carried out. For example, bit allocation may be performed in order to encode coefficient data generated by the MDCT processing. In this case, an appropriate number of bits are allocated to the MDCT coefficient data acquired by performing the MDCT processing on the time-axis signal for each block.
The bit allocation is detailed in, for example, R. Zelinski and P. Noll, “Adaptive Transform Coding of Speech Signals,” IEEE Transactions of Accoustics, Speech and Signal Processing, Vol. ASSP-25, August 1977, and M. A. Kransner, MIT, “The Critical Band Coder Digital Encoding of the Perceptual Requirements of the Audiotory System,” ICASSP 1980.
Any audio signal input to an encoding apparatus contains various components such as the sounds of musical instruments and human voice. Even if a microphone records only voice or piano sound, the resultant signal does not represent the voice or piano sound alone. The signal usually contains background noise, i.e., the sound the recording device makes while being used, and also the electrical noise the recording device generates.
These noises, as well as the voice and piano sound, are no more than linear waveform information to the encoding apparatus. The apparatus will perform frequency-encoding on the noise components, too. This is a correct approach from a viewpoint of waveform-reproducibility. In view of the human auditory characteristics, however, this cannot be said to be an efficient encoding method.
Thus, bit allocation based on a psychological auditory model may be carried out. That is, no bit allocation is performed on any frequency component that is smaller than the lowest audible level at which man can hear nothing, or smaller than the minimum encoding threshold value arbitrarily set in the encoding apparatus.
FIG. 1 outlines the configuration of a conventional encoding apparatus that performs such bit allocation as described above. In the encoding apparatus 100, a time-to-frequency transforming unit 101 transforms an input audio signal Si(t) to a spectral signal F(f) as is illustrated in FIG. 1. The spectral signal is supplied to a bit-allocation frequency-band determining unit 102. The bit-allocation frequency-band determining unit 102 analyzes the spectral signal F(f). It then divides the spectral signal into a frequency component F(f0) and a frequency component F(f1). The frequency component F(f0) is at a level equal to or higher than the lowest audible level, or is equal to or greater than the minimum encoding-threshold value, and will be subjected to bit allocation. The frequency component F(f1) will not be subjected to bit allocation. Only the frequency component F(f0) is supplied to a normalization/quantization unit 103. The frequency component F(f1) is thus discarded.
The normalization/quantization unit 103 carries out normalization and quantization on the frequency component F(f0), generating a quantized value Fq. The value Fq is supplied to an encoding unit 104. The encoding unit 104 encodes the quantized value Fq, generating a code train C. A recording/transmitting unit 105 records the code train C in a recording medium (not shown) or transmits the code train as a bit stream BS.
The code train C generated by the encoding apparatus 100 may have such a format as is shown in FIG. 2. As FIG. 2 depicts, the code train C is composed of a header H, normalization information SF, quantization precision information WL, and frequency information SP.
FIG. 3 outlines the configuration of a decoding apparatus that may be used in combination with the encoding apparatus 100. In the decoding apparatus 120, a receiving/reading unit 121 restores the code train C from the bit stream BS received from the encoding apparatus 100, or from the recording medium (not shown), as is illustrated in FIG. 3. The code train C is supplied to a decoding unit 122. The decoding unit 122 decodes the code train C, generating a quantized value Fq. An inverse-quantization/inverse-normalization unit 123 performs inverse quantization and inverse normalization on the quantized value Fq, thus generating a frequency component F(f0). A frequency-to-time transforming unit 124 transforms the frequency component F(f0) to an output audio signal So(t). The output audio signal So(t) is output from the decoding apparatus 120.
FIG. 4 illustrates a case where no bit allocation is performed on any frequency component that is, in all frames, at a level lower than the lowest audible level A. As FIG. 4 shows, only frequency components of 0.60 f or less are encoded in the (n−1)th frame, all frequency components up to 1.00 f are encoded in the n-th frame, and only frequency components of 0.55 f or less are encoded in the (n+1)th frame. As a result, a component of a specific frequency is contained in some frame, and is not contained in some others. Nonetheless, the code train can equivalently contain all frequency components for all frames, because the components of the frequencies, not contained in the code train is absolutely inaudible to man. Hence, the music reproduced from the code train does not make the listener feel any psychological auditory strangeness.
When all frequency components at levels equal to or higher than the lowest audible level are encoded, however, those components that are not important or the white noise that need not be heard are encoded, too. The encoding is therefore inefficient. Assume that the frequency components are encoded at a fixed bit rate, thus allocating the same number of bits to each frame. Then, some frames may fail to have a number of bits, large enough to reproduce sound of satisfactory quality, if the bit rate is too low.
FIG. 5 illustrates a case where no bit allocation is performed on any frequency component that has a value smaller than the minimum encoding threshold value a set for each frame. As FIG. 5 shows, the encoding apparatus sets a minimum encoding threshold value a(n−1) for the (n−1)th frame. This value a(n−1) is regarded as not influencing the sound quality even if it is not recorded in the (n−1)th frame. This is because any component that has a frequency lower than this value is not so important to sound quality. As a result, only frequency components of 0.60 f or less are encoded in the (n−1)th frame.
If the frequency component that is not encoded has the same value in all frames, all frequency components encoded are considered as equivalent to components that are encoded after passing a low-pass filter. The band may therefore be perceived as narrowed in some cases. Nevertheless, this sense of a narrowed band is not so problematical in consideration of the original frequency distribution and the auditory characteristics of man.
However, the next frame, i.e., the n-th frame, has but small energy and has more frequency components not encoded, than the (n−1)th frame. In the (n+1)th frame, which has large energy, all frequency components are encoded since the encoding apparatus determines that they are important to the auditory sense.
If the frequency components contained in the code train so vary from frame to frame, they will jeopardize the continuity of frames when they are reproduced. They may be felt as obvious noise. This noise is similar to the background noise of FM broadcasting, which varies with time as the condition of radio wave changes. Consequently, the listener feels that the music contains a specific noise, inevitably perceiving psychological auditory strangeness.
Jpn. Pat. Appln. Laid-Open Publication No. 8-166799 filed by the applicant hereof discloses a technique of preventing the generation of noise. In the technique, the bandwidth in which bit allocation has been performed on the preceding frame is recorded and stored. The bandwidth to perform bit allocation to the present frame is determined, not so much different from that bandwidth. This controls the changes in the reproduction band and ultimately prevents generation of noise.
The technique disclosed in Jpn. Pat. Appln. Laid-Open Publication No. 8-166799 indeed helps to stabilize the reproduction band. However, it cannot completely solve the auditory problem since it allows for fluctuation of the reproduction band.
To stabilize the reproduction band, components of frequencies falling within a band inherently unnecessary may be recorded, or components of frequencies falling within a band inherently necessary may not be recorded. Either case is undesirable in view of encoding efficiency.
All frequencies may be analyzed for several frames or several tens of frames, and the same frequency at which bit allocation should be performed may be applied to all frames. This method is not practical, however, in view of the real-time processing required and the cost of memories and processors incorporated in the public-use hardware. Further, the method does not seem to increase the encoding efficiency.