Many techniques for compressing digital audio or speech signals are known. For example, in sub-band coding, a non block-forming frequency band dividing system, in which the input audio signal is not divided in time into blocks, but is divided in frequency by a filter into plural frequency bands for quantizing. In a block-forming frequency band dividing system, such as a transform coding system, the input audio signal in the time domain is converted into spectral coefficients in the frequency domain by an orthogonal transform. The resulting spectral coefficients are divided by frequency into plural frequency bands, and the spectral coefficients in each band are quantized.
A technique consisting of a combination of sub-band coding and transform coding is also known. In this, frequency range signals produced by dividing the input audio signal in frequency without dividing it into blocks are individually orthogonally transformed into spectral coefficients. The spectral coefficients are then divided by frequency into plural frequency bands, and the spectral coefficients in each band are then quantized.
Among the filters useful for dividing a digital audio input signal into frequency ranges without dividing it into blocks is the quadrature mirror (QMF) filter, which is described, for example, in R. E. Crochiere, Digital Coding of Speech in Sub-bands, 55 BELL SYST. TECH. J. No.8, (1976). A technique of dividing the audio input signal in frequency into frequency bands of an equal width is discussed in Joseph H. Rothweiler, Polyphase Quadrature Filers-a New Sub-band Coding Technique, ICASSP 83, BOSTON (1983).
Known techniques for orthogonally transforming an input signal include the technique of dividing the digital input audio signal in time into blocks having a predetermined duration, and processing the resulting blocks using a fast Fourier transform (FFT), a discrete cosine transform (DCT), or a modified DCT (MDCT) to convert each block of the digital audio signal in the time domain into a set of spectral coefficients in the frequency domain. A modified DCT is discussed in J. P. Princen and A. B. Bradley, Subband/Transform Coding Using Filter Bank Based on Time Domain Aliasing Cancellation, ICASSP 1987.
As a technique for quantizing the spectral coefficients obtained by frequency division, it is known to divide the spectral coefficients by frequency into bands to take account of the frequency resolution characteristics of the human sense of hearing. The audio frequency range of 0 Hz to 20 or 22 kHz may be divided in frequency into bands, such as 25 critical bands, which have a bandwidth that increases with increasing frequency. The spectral coefficients in each of the bands are quantized by adaptive bit allocation applied to each band. For example, the spectral coefficients resulting from a modified discrete cosine transform (MDCT) are divided by frequency into bands, and the spectral coefficients in each band are quantized using an adaptively-determined number of bits.
Two known adaptive bit allocation techniques will be now be described. First, in the technique described in ASSP-25, IEEE TRANSACTIONS OF ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, No.4, August 1977, bit allocation is carried out on the basis of the magnitude of the signals of the respective bands. Although this system provides a flat quantizing noise spectrum, and minimizes noise energy, noise perceived by the listener is not minimized because this technique does not exploit the masking characteristics of the human sense of hearing.
On the other hand, the technique described in M. A. Kransner, The Critical Band Coder-Digital Encoding of the Perceptual Requirements of the Auditory System, ICASSP 1980, uses the masking characteristics of the human sense of hearing to determine the signal-to-noise ratio necessary for each band to make a fixed quantizing bit allocation. However, this technique provides relatively poor results with a single sine-wave input because of its fixed bit allocation.
As a high-efficiency system for compressing digital audio signals, employing, for example, the above-mentioned sub-band coding system, a high-efficiency compression system called ATRAC is already used in practical applications. This system compresses digital audio signals to about 20% of their original bit requirement by taking advantage of the characteristics of the human sense of hearing using adaptive transform acoustic coding. ATRAC is a registered trademark of one of the present assignees (Sony Corporation).
Multi-channel audio or speech signals in four to eight channels are not only encountered in, for example, commonplace audio equipment, but are also encountered in stereo or multi-channel sound systems, such as those found in motion picture theaters, high-quality television systems, video tape recorders, and video disc players. In such cases, the use of high-efficiency compression is desirable to reduce the bit rate required to represent the large number of audio signals.
In particular, in commercial applications, a tendency towards multi-channel digital sound signals and equipment handling eight-channel digital sound signals has developed. Typical of the equipment handling eight-channel digital sound signals are motion picture theater sound systems, and the apparatus that electronically reproduces the pictures and sound of a motion picture film via various electronic media, in particular apparatus such as high-quality television systems, video tape recorders, and video disc players. In the sound systems of such apparatus, the tendency is towards multi-channel sound systems of between four and eight channels.
Motion picture theater sound systems have recently been proposed that record on a motion picture film the digital sound signals for the following eight channels: left, left-center, center, right-center, right, left surround, right surround, and sub-woofer. These sound channels are respectively reproduced by left loudspeaker, a left-center loudspeaker, a center loudspeaker, a right-center loudspeaker, and a right loudspeaker, all arranged behind the screen; a sub-woofer located behind or in front of the screen; and a left-surround loudspeaker and a right-surround loudspeaker. For the left-surround speaker and the right-surround speaker, two groups of loudspeakers are respectively arranged on the left side wall and the left part of the back wall of the auditorium, and on the right side wall and the right part of the back wall of the auditorium. The two groups of loudspeakers on the sides and back of the auditorium generate a sound field rich in ambience to accompany spectacular optical effects on the large-format screen of the motion picture theater. For simplicity, these two groups of loudspeakers will from now on be referred to as the "left-surround loudspeaker" and the "right-surround loudspeaker."
It is difficult to record on a motion picture film eight channels of 16-bit linear-quantized digital audio with the sampling frequency of 44.1 kHz, such as is employed in a compact disc (CD), because the film lacks an area capable of accommodating a soundtrack wide enough for such a signal. The width of the motion picture film and the width of the picture area on the film are standardized. The width of the film cannot be increased, or the width of the picture area cannot be decreased to accommodate a soundtrack of the width required for digital audio signals of this type. A standard-width film, with a standard picture area, a standard analog sound track, and standard perforations, has only a narrow area in which digital audio signals can be recorded. Accordingly, eight channels of digital sound can only be recorded if the digital sound signals are compressed prior to recording on the film. The eight channels of digital sound may be compressed using the above-mentioned ATRAC high-efficiency compression system.
Motion picture films are susceptible to scratches, which can cause drop-outs if digital sound signals are recorded without any form of error detection and correction. Therefore, the use of error correction codes is essential, and this must be taken into account when the signal compression is performed.
Optical discs have become popular as a medium for providing motion pictures in the home. It is desirable to be able to record multi-channel sound with four to eight channels on an optical disc to provide more realistic sound than conventional stereo sound. On an optical disc, the data volume of the video signal is as many as ten times that of the sound signal, and only a limited recording area is provided for the sound signal. Especially when the picture signal is required to provide a high picture quality, as is required with the current trend towards larger-size screens, as much of the recording area as possible is devoted to the picture signal. Thus, the sound signal must be subject to a high degree of compression if the desired number of channels is to be provided in the recording area available for the sound signal.
When the above-mentioned ATRAC high-efficiency compression system proposed by one of the present assignees (Sony Corporation) is used in a stereo (two-channel) audio system, the audio signal in each channel is compressed independently of the other. This enables each channel to be used independently, and simplifies the processing algorithm used to compress the audio signals. Operated this way, the ATRAC system provides sufficient compression for most applications, and the sound quality obtained when an audio signal is compressed and expanded using the ATRAC system is well regarded.
However, because it compresses each audio signal independently, it cannot be said that the bit allocation process by which the present ATRAC system performs its compression operates at highest efficiency. For example, if the signal level in one of the channels is very low, the signal can be represented adequately using a small number of bits. On the other hand, the signal in another channel may require a much larger number of bits to represent it adequately. Yet the present ATRAC system allocates the same number of bits to each channel, irrespective of the number of bits actually required to adequately represent the signal in the channel. Thus, to provide its high quality of reproduction, there must be some redundancy in the bit allocation performed by the present system.