A variety of techniques exist for digitally encoding audio or speech signals using bit rates considerably lower than those required for pulse-code modulation (PCM). In sub-band coding (SBC), a filter bank divides the frequency band of the audio signal into a plurality of sub bands. In sub-band coding, the signal is not formed into frames along the time axis prior to coding. In transform coding, a frame of digital signals representing the audio signal on the time axis is converted by an orthogonal transform into a block of spectral coefficients representing the audio signal on the frequency axis.
In a combination of sub-band coding and transform coding, digital signals representing the audio signal are divided into a plurality of frequency ranges by sub-band coding, and transform coding is independently applied to each of the frequency ranges.
Known filters for dividing a frequency spectrum into a plurality of frequency ranges include the Quadrature Mirror Filter (QMF), as discussed in, for example, R. E. Crochiere, Digital Coding of Speech in Subbands, 55 BELL SYST. TECH. J., No. 8, (1976). The technique of dividing a frequency spectrum into equal-width frequency ranges is discussed in Joseph H. Rothweiler, Polyphase Quadrature Filters: A New Subband Coding Technique, ICASSP 83 BOSTON.
Known techniques for orthogonal transform include the technique of dividing the digital input audio signal into frames of a predetermined time duration, and processing the resulting flames using a Fast Fourier Transform (FFT), discrete cosine transform (DCT) or modified DCT (MDCT) to convert the signals from the time axis to the frequency axis. Discussion of an MDCT may be found in J. P. Princen and A. B. Bradley, Subband/Transform Coding Using Filter Bank Based on Time Domain Aliasing Cancellation, ICASSP 1987.
In a technique of quantizing the spectral coefficients resulting from an orthogonal transform, it is known to use sub bands that take advantage of the psychoacoustic characteristics of the human auditory system. In this, spectral coefficients representing an audio signal on the frequency axis may be divided into a plurality of critical frequency bands. The width of the critical bands increases with increasing frequency. Normally, about 25 critical bands are used to cover the audio frequency spectrum of 0 Hz to 20 kHz. In such a quantizing system, bits are adaptively allocated among the various critical bands. For example, when applying adaptive bit allocation to the spectral coefficient data resulting from a Modified Discrete Cosine Transform (MDCT), the spectral coefficient data generated by the MDCT within each of the critical bands is quantized using an adaptively-allocated number of bits.
Known adaptive bit allocation techniques include that described in IEEE TRANS. ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, Vol. ASSP-25, No. 4 (1977, August) in which bit allocation is carried out on the basis of the amplitude of the signal in each critical band. This technique produces a flat quantization noise spectrum and minimizes noise energy, but the noise level perceived by the listener is not optimum because the technique does not effectively exploit the psychoacoustic masking effect.
In the bit allocation technique described in M. A. Krassner, The Critical Band Encoder: Digital Encoding of the Perceptual Requirements of the Auditory System, ICASSP 1980, the psychoacoustic masking mechanism is used to determine a fixed bit allocation that produces the necessary signal-to-noise ratio for each critical band. However, if the signal-to-noise ratio of such a system is measured using a strongly tonal signal, for example, a 1 kHz sine wave, non-optimum results are obtained because of the fixed allocation of bits among the critical bands.
Block floating is a normalization process applied to a block of data comprising plural words, such as a block of plural spectral coefficients. Block floating is applied by multiplying each word in the block by a common value for the block to improve quantization efficiency. In a typical block floating process, the maximum absolute value of the words in the data block is found and is used as a block floating coefficient common to all the words in the data block. Using the maximum absolute value in the band as the block floating coefficient prevents data overflow because the absolute value of no other word in the data block can be greater than the maximum absolute value. A simplified form of block floating determines the block floating coefficient using a shift quantity, which provides block floating in 6 dB steps.
A system for compressing an input audio signal using an audio signal processing method that includes block floating is likely to suffer from a phenomenon known as pre-echo. Pre-echo seriously impairs the sound quality obtained when the compressed digital signal is subsequently expanded, decoded, and reproduced.
Pre-echo manifests itself when there is a transient in the audio input signal being coded, i.e., when the amplitude of the audio input signal increases rapidly. Pre-echo occurs when the transient in the audio input signal occurs part-way in the time frame subject to block floating, especially when the transient occurs towards the end of the frame. The coding system sets a quantization noise level that remains constant during each frame according to the maximum signal level in the frame. This quantization noise level is inappropriate for the part of the frame occurring before the transient, when the signal level is not high enough the mask the quantization noise level. As a result, quantization noise is audible before the transient occurs.
Pre-echo is not heard in the latter part of the frame after the transient because of the psychoacoustic property of the human sense of hearing called "masking." Masking is a psychoacoustic phenomenon in which a signal is rendered inaudible, or "masked," by other signals occurring simultaneously with, or slightly earlier than, or later than, the signal. Masking effects may be classed into time axis masking effects, that is, masking by signals occurring earlier or later than the masked signal, and concurrent masking effects, which is masking is by simultaneously-occurring signals having a frequency different from the frequency of the masked signal. Backward temporal masking, i.e., masking by a high level sound of noise occurring before the high-level sound, has a considerably shorter duration than forward temporal masking, i.e., masking by a high level sound of noise occurring after the high level sound has ceased.
Masking enables a signal to render inaudible any noise within its time or frequency masking range. This means that a digital coding system that produces quantizing noise may have quantizing noise levels that are high compared with the noise level that is allowable in the absence of a signal provided that the quantizing noise lies within the masking range of the signal. Since relatively high levels of quantizing noise are allowable if masked by the signal, the number of bits required to represent the signal, or parts of the signal, may be significantly reduced.
A critical band is a measure of the range of frequencies that can be masked by a signal. A critical band is the band of noise that can be masked by a pure signal that has the same intensity as the noise and has a frequency in the middle of the critical band. The width of successive critical bands increases with increasing frequency of the pure signal. The audio frequency range of 0 Hz to 20 kHz is normally divided into, e.g., 25 critical bands.
FIG. 7 shows a digital representation of an audio signal containing a transient in which the level of the audio signal rapidly increases. The digital signal is divided into frames that include a predetermined number of samples, i.e., the frames T1 through T4, and block floating is applied. If the digital signal subsequent to compression by block floating is expanded, decoded, and reproduced, the quantization noise level in the part of the frame T2 prior to the transient is too high to be masked by the signal level in that part of the frame T2, and pre-echo will be audible.
FIG. 9 shows a schematic arrangement of a conventional audio signal processing apparatus for compressing the digital signal representing an audio signal. The signal processing apparatus processes a digital input signal by a method in which the input digital signal is divided in time into frames of a predetermined number of samples, each frame of samples in the time domain is orthogonally transformed into a block of spectral coefficients in the frequency domain, block floating is applied to each block of spectral coefficients, each block of spectral coefficients is quantized by adaptive bit allocation, and the quantized signals are transmitted simultaneously with parameters relevant to block floating and adaptive bit allocation. The process by which pre-echo is caused will next be described in connection with the arrangement shown in FIG. 9.
In FIG. 9, the frames of the digital audio input signal TS in the time domain are fed into the input terminal 101. Each frame of the input digital signal TS is transformed in plural spectral coefficients SP by the orthogonal transform circuit 111. The spectral coefficients SP are transmitted to the spectral coefficient quantization circuit 115.
The spectral coefficient quantization circuit 115 normalizes the spectral coefficients SP by block floating, and then quantizes the normalized spectral coefficients SP with an adaptively allocated number of bits.
Each block of spectral coefficients SP resulting from transforming one frame of the input signal is also supplied to the block floating coefficient calculating circuit 113, which calculates at least one block floating coefficient SF for each block, and provides the block floating coefficients SF to the spectral coefficient quantization circuit 115. The spectral coefficient quantization circuit then carries out block floating on each block using the block floating coefficient SF received from the block floating coefficient calculating circuit 113. The spectral coefficients may divided into bands, a block floating coefficient may be calculated for each band, and block floating may be applied to the spectral coefficients in each band.
The spectral coefficients SP are also supplied to the bit allocation calculating circuit 114 which provides a word length WL that indicates the number of bits to be used by the spectral coefficient quantization circuit 115 for quantizing the spectral coefficients. The number of bits used for quantizing determines the quantizing noise level: increasing the number of bits reduces the quantizing noise level. The number of quantizing bits is calculated in response to an allowable noise level calculated from the energy level in each block of spectral coefficients SP, taking masking into consideration.
Quantized spectral coefficients QSP produced by processing each block of spectral coefficients with block floating in response to the block floating coefficient SF, and by quantizing each block of spectral coefficients with a number of bits adaptively allocated in response to the word length WL, are fed from the spectral coefficient quantization circuit 115 to the bit stream converting circuit 116. The bit stream converting circuit converts the quantized spectral coefficients QSP, the block floating coefficient SF, and the word length WL, for each block into a bit stream. The output of the bit stream converting circuit is provided as an output bit stream to the output terminal 102.
FIG. 10 shows a practical arrangement of the bit allocation calculating circuit 114. Each block of spectral coefficients SP is supplied via the terminal 141 to the energy calculating circuit 145 which determines the energy distribution. This can involve determining the energy distribution among the critical bands. The output of the energy calculating circuit 145, the energy level EN, is supplied to the allowable noise level calculating circuit 146, which determines the allowable noise level AN for each block in response to the energy distribution found by the energy calculating circuit 145, and taking masking into account.
The output of the allowable noise level calculating circuit 146, the allowable noise level AN, is supplied to the word length calculating circuit 147, which also receives the block floating coefficient SF from the block floating coefficient calculating circuit 113 via the terminal 143. The word length calculating circuit 147 determines the word length WL, indicating the number of allocated bits, from the value of the block floating coefficient for each respective block.
FIG. 11 shows the units to which block floating is applied. Each unit is represented by a rectangle. Block floating is applied to each unit produced by dividing along the frequency axis and the time axis. The division along the time axis represents the division of the input signal into frames. The division along the frequency axis represents a division of the spectral coefficients resulting from the orthogonal transform of one frame of the input signal into one or more bands, preferably into 25 critical bands. Thus, each rectangle shown in FIG. 11 represents a block consisting of one band of the spectral coefficients resulting from transforming one frame of the input signal.
For example, for each of the four successive frames of the input signal shown in FIG. 11 (frames T1 through T4), the bit allocation calculating circuit 114 performs a set of processing operations to determine the number of bits to allocate for quantizing the spectral coefficients in the each of the blocks B1 through B4, each of the blocks B1 through B4 being a block in one band of the spectral coefficients resulting from transforming the frames TI through T4, respectively, of the input signal.
FIG. 12 shows how the bit allocation calculating circuit 114 adaptively allocates bits to the blocks B1 through B4 in the four successive frames T1 through T4, respectively, of the audio signal shown in FIG. 7. The bit allocation calculating circuit 114 finds the energy levels EN1 through EN4 in the blocks B1 through B4, respectively. From the respective energy levels, the bit allocation calculating circuit calculates the allowable noise levels AN1 through AN4, in response to the energy levels EN1 to EN4. The bit allocation calculating circuit 114 also calculates the word length WL and the value of the block floating coefficient SF for each for the blocks B1 through B4 from the respective allowable noise levels AN1 through AN4.
Referring to FIGS. 7 and 12, since the signal level in the latter part of the frame T2 is considerably greater than the signal level in the former part, the energy level EN2 of the block B2, and the allowable noise level calculated in response to the energy level EN2, i.e., is the quantizing noise level masked by the energy level EN2, are determined in response to the energy level in the latter part of the frame. As a result, the number of bits allocated for quantizing the spectral coefficients SP in the block B2, i.e., the number of bits indicated by the word length WL2, is allocated such that the quantization noise level in the block B2 is below the allowable noise level AN2, determined in response to the high energy level EN2 in the latter part of the frame.
The signal level is low in the former part of frame T2, as shown in FIG. 7, so the allowable noise level AN21 in the former part of the frame T2, as shown in FIG. 13, should be a relatively low value. On the other hand, the signal level in the latter part of the frame T2 is considerably greater than in the former part, shown in FIG. 7, so the allowable noise level AN22 of the block B22 corresponding to the latter part of the frame T2 has a considerably higher value, as shown in FIG. 13. FIG. 13 also shows energy levels EN21 and EN22 in the blocks B21 and B22 corresponding to the former part and the latter part of the frame T2, for the input signal shown in FIG. 7.
Because of the differences in energy level and allowable noise level between the two parts of the frame T2, as shown in FIG. 13, if the number of bits allocated for quantizing the block B2 is determined in the manner shown in FIG. 12, the quantizing noise level is set at a level that exceeds the allowable noise level AN21. This level of quantizing noise is audible as pre-echo in the former part of the frame T2 shown in FIG. 13 (in the block B21).
It is known to select the duration of a frame to be as short as possible to reduce the possibility of pre-echo. This reduces the time during which pre-echo can occur, which renders pre-echo inaudible by taking advantage of backward masking. In backward masking, a low level sound or noise is masked by a later occurring loud sound. Backward temporal masking operates over a relatively short time span. If the low level sound or noise persists for more than a millisecond or so before the loud sound occurs, the high level sound will not mask the noise or low level sound, and the noise or low-level sound will be heard by the listener. Backward temporal masking is to be contrasted with forward temporal masking in which a high level sound masks a low level sound or noise that occurs after the high level sound has ceased. Forward temporal masking operates over a considerably longer time span than backward temporal masking.
Reducing the duration in time of a frame, i.e., reducing the number of samples in a frame, to take advantage of backward temporal masking is undesirable because reducing the frame duration reduces the efficiency of the data compressor. The duration of a frame cannot be reduced far enough for backward masking to be relied upon while providing an acceptable compression efficiency.
It is also known to vary the duration of a frame such that the duration is shortened only in those frames in which the signal level increases rapidly. Since block floating is performed on the block of spectral coefficients resulting from transforming into the frequency domain a frame of the input digital signal in the time domain, it is not possible to vary the frame duration over too wide a range because of the window shape applied when performing the transform. Moreover, reducing the length of the frame reduces the resolution in the frequency domain. Thus, pre-echo cannot be reduced beyond a certain limit by temporarily reducing the duration of the frame.
It is also known to detect frames in which there is a rapid increase in signal level, and to allocate redundant bits to such frame in an attempt to reduce quantizing distortion. However, in such a method, it has proved difficult to decide correctly how many redundant bits should be allocated.