This invention relates to an audio encoding apparatus and, more particularly, to an audio encoding apparatus for splitting an audio signal into a plurality of bands, allocation the number of quantization bits for each band and transmitting the audio signal of each band upon quantizing the audio signal by the allocated number of bits.
An example of an apparatus that employs highly efficient encoding of acoustic (audio) signals is remote monitoring apparatus that multiplexes audio and video and transmits them in one direction in real-time. Such a remote monitoring apparatus makes it possible to monitor a situation by way of dynamic images and sound (audio) without requiring that an individual make rounds for inspection. Such an apparatus has a variety of applications. For example, by deploying such an apparatus at a plurality of stores, conditions within the stores can be monitored collectively at the main office. By deploying the apparatus at various points along a road, traffic tie-ups along the road can be ascertained. Another application besides use as a remote monitoring apparatus is a TV conferencing system, in which two-way communication is required.
FIG. 11 is a diagram showing the configuration of a remote monitoring system. The system includes a decoding unit 1 serving as a central monitor provided at a monitoring center, and an encoding unit 2 serving as a monitor provided at a location where monitoring is required. A number of the encoding units 2 are provided and are capable of transmitting audio and video to the central monitoring unit 1 via transmission lines 3. The encoding unit 2 includes input devices such as a camera 2a and microphone 2b for entering video and sound (audio) signals, respectively, an image encoder 2c and an audio encoder 2d for compressing the video and audio signals, respectively, and a multiplexer (MUX) 2e for multiplexing the compressed video and audio signals. The multiplexed signals are transmitted to another unit (the decoding unit 1) via the transmission line 3. The decoding unit 1 includes a demultiplexer (DEMUX) 1a for demultiplexing the compressed signals, which have been transmitted from the encoding unit side, into video and audio signals, and a video decoder 1b and audio decoder 1c for decompressing the compressed video and audio signals, respectively. The decompressed video and audio signals are output from output devices such as a monitor 1d and speaker 1e, respectively.
Compression employs 32 subband encoding (band-splitting encoding) as the technique for highly efficient encoding of audio signals and utilizes a psychoacoustic characteristic to realize highly efficient compression. The human ear cannot hear sounds below a certain level. A characteristic curve obtained by plotting this level for each and every band is referred to as a minimum masking threshold value curve (minimum audible threshold curve) MTC (see FIG. 12). The masking effect varies depending upon the conditions of sound in the surrounding area and small sounds cannot be heard because of large sounds even if the sound has a level greater than the minimum masking threshold value curve MTC. The reason for this is that the masking threshold value curve is changed by large sounds, as indicated by MTC' in FIG. 12. Sound components A, B below this curve are masked and are inaudible to the human ear. Components C, D extending beyond the masking threshold value curve MTC' can be heard.
In view of the foregoing, the sounds A, B below the masking threshold value level MTC' are not quantized but the sounds C, D above the masking threshold value level are. In case of quantization, this is carried out upon allocating numbers of quantization bits in dependence upon the difference between an audio level S and masking threshold value level M in each subband. The quantized data and the numbers of bits allocated are output.
More specifically, as shown in FIG. 13, one frame is constituted by audio signals in 36 sub-frames (one sub-frame consists of 32 samples), the audio signal of each sub-frame is subdivided into 32 subbands, and subband encoding of 32 bands is carried out. That is, the entire band is split into 32 equally spaced frequency widths, each sample signal is encoded by being quantized in dependence upon the number of quantization bits of each subband (described later), and 1152 (=36.times.32) items of sample data are adopted as one frame.
One scale factor is decided in common for 36 items of sample data of one subband sbi (i=0-31). In other words, normalization is performed in such a manner that the maximum value of each of 36 waveforms will become 1.0, and the normalization scale factor is encoded as the scale factor.
Further, the number of quantization bits of each subband sbi is decided and adopted as the number of allocated bits. The masking effect can be utilized most effectively by specifying the quantization precision (number of quantization bits) to the very limit of the masking level that takes the width of the critical band into account. Masking makes it possible to completely eliminate information concerning a band that contains only signals whose level cannot be sensed by the auditory system. In such case bits are not allocated as sample data. In other words, sampling data is non-existent in a case where the number of quantization bits of sample data in each subband is zero.
FIG. 14 is a diagram useful in describing the structure of one frame of an audio bit stream. Numeral 10 denotes the smallest unit capable of being decoded into an audio signal individually. The smallest unit 10 always includes data of a fixed number of samples, i.e., 1152 (=36.times.32). The smallest unit 10 is composed of a 32-bit header 11, an error-check code (optional) 12 and an audio data field 13. The audio data field 13 has a quantization bit count 13a, a scale factor 13b and sample data 13c. The header 11 includes a 12-bit all "1"s synchronization word 11a, an ID 11b that is always "1", a layer identification 11c and information such as a bit-rate index, sampling frequency and mode.
The audio data field 13 has the structure shown in FIG. 15. The quantization bit count 13a indicates the number Bi of quantization bits in each of 36 items of sampling data in each subband sbi (i=0-31), and the scale factor 13b indicates the normalization scale factors of those items of sampling data in each subband sbi (i=0-31) for which the numbers of quantization bits are other than zero. Each item of sampling data of a subband sbi for which the quantization bit count is not zero is multiplied by the corresponding scale factor Si and the product is quantized by the quantization bit count Bi to obtain the sample data 13c.
FIG. 16 is a diagram showing the construction of an audio encoder according to the prior art. The encoder includes a band splitting filter 21 for splitting an input audio signal into a signal of n frequency bands (e.g., n=32 subbands), and a psychoacoustic model 22 constituted by an FFT analyzer. The psychoacoustic model 22 obtains the masking threshold value characteristic MTC' (described above in connection with FIG. 12) whenever audio signals of m samples per frame (m=32.times.36=1152) enter, and calculates an SMR (signal-to-mask ratio) for every subband sbi (i=0-31) from the masking level M and signal level S in each subband sbi of the masking threshold value characteristic MTC' . The SMR is the ratio of signal level S to masking level M and is measured in decibels, obtained by 10 log(S/M).
The encoder further includes a bit allocator 23 for allocating the quantization bit count Bi to each band sbi (i=0-31) in accordance with bit allocation processing, described later. The bit allocator 23 calculates an MNR (mask-to-noise ratio) of each band based upon the SMR of each band sbi output by the psychoacoustic model 22 and increments the quantization bit number of the band having the smallest MNR (i.e., performs the operation Bi+1.fwdarw.+Bi). The MNR is the ratio of masking level M to quantization noise and is measured in decibels, obtained by 10 log(M/N). The larger the quantization noise N, i.e., the smaller the number of quantization bits, the smaller the value of MNR. The smaller the quantization noise N, i.e., the larger the number of quantization bits, the greater the value of MNR. Further, the quantization noise N is decided by the number of quantization bits. If the number of quantization bits is known, therefore, the SNR [=10 log(S/N)] of the audio signal level S to the quantization noise level N will be known.
Thus, if the SMR of a band of interest is subtracted from the SNR obtained from the quantization bit number of this band, the MNR of the band of interest can be calculated. In other words, MNR can be calculated as follows: ##EQU1##
The bit allocator 23 repeats calculation of the MNR of each band, determination of the smallest MNR and processing for incrementing the quantization bit count of the band having this smallest MNR until it distributively allocates the total number A of bits per frame, obtained from the number of quantization bits of the band of interest, to all bands sb.sub.0 -sb.sub.31. When the total number A of bits per frame have been distributively allocated to all bands, control for allocation of the quantization bit numbers to the bands sb.sub.0 -sb.sub.31 is terminated.
The audio encoder further includes an encoding unit 24 for encoding the quantization bit count (the number of allocated bits) of each band, and a bit-rate setting unit 25 for setting the bit rate from an external unit in advance. A total of 14 bit rates (32-448 kbps) are stipulated and the prescribed bit rate is set. A scale factor computing unit 26 calculates one scale factor Si in common for 36 items of sample data in each band sbi (i=0-31). The scale factor computing unit 26 performs normalization in such a manner that the maximum value of each of 36 waveforms will become 1.0, and calculates the normalization scale factor as the scale factor. An encoding unit 27 codes this scale factor. The results obtained by multiplying each of the 36 items of sample data of each band sbi by the scale factor Si of the band are applied to a quantizer 28. The latter quantizes these results by the quantization bit count Bi of the band. The quantized data, scale factor and quantization bit count that have been encoded are applied to a bit multiplexer 29, which multiplexes the bits of these inputs and transmits them as a bit stream at the set bit rate.
The band dividing filter 21 splits the input audio signal into a signal of n frequency bands (e.g., n=32), and the psychoacoustic model 22 calculates the SMR for each of the n bands sb.sub.0 -sb.sub.31 upon taking into account the masking effect, which is the auditory characteristic of the human ear. The bit allocator 23 calculates the MNR of each band in accordance with Equation (1) based upon the SMR of each of the n bands sb.sub.0 -sb.sub.31. Next, the bit allocator 23 calculates the number A of bits per frame from the bit rate set by the bit-rate setting unit 25 and allocates quantization bits one bit at a time to the band indicating the smallest MNR until the total number of allocated bits attains the bit count A. The scale factor computing unit 26 calculates the scale factor using 36 items of sample data of each band sbi (i=0-31) resulting from band splitting by the band splitting filter 21, and the quantizer 28 quantizes each sample signal of each band sbi using the scaling factor Si (i=0-31) and quantization bit count Bi (i=1-31). The bit multiplexer 29 multiplexes (1) the quantization code, which is the output of the quantizer, (2) the code obtained by encoding the output (scale factor) of the scale factor computing unit 26, and (3) the code obtained by encoding the bit allocation information, and transmits these codes in the form of a bit stream based upon the bit rate set by the bit-rate setting unit 25.
FIG. 17 is a diagram useful in describing bit allocation by the bit allocator 23 according to the prior art. Components in FIG. 17 identical with those shown in FIG. 16 are designated by like reference characters. Shown in FIG. 17 are the psychoacoustic model 22, the bit allocator 23 and the bit-rate setting unit 25.
When an audio signal enters the psychoacoustic model 22, the latter calculates the SMR value of each band sbi (i=0-31) taking into count the auditory characteristic of the human ear. Using the calculated SMR of each band, the bit allocator 23 allocates bits for quantization to each band sbi (i=0-31). More specifically, the bit allocator 23 calculates the number A of allocable bits per frame from the bit rate set by the bit-rate setting unit 25 (i.e., from one of the 14 bit rates of 32-447 kbps) (step 101). Highly efficient audio encoding is a method of processing audio signals in a certain, fixed mass, which is referred to as a frame. By way of example, one frame consists of 36 sub-frames by 32 subbands. The length of time used for one frame generally is 20-40 ms because it is believed that there will be no significant change in sonic quality during this period of time. The number A of bits per one such frame is calculated in accordance with the following equation: EQU A=set bit rate.times.frame length (2)
Accordingly, if we let Fs (kHz) represent the sampling frequency and Br (Kbps) the bit rate, then Equation (2) may be written EQU A=Br.times.(32.times.36/Fs) (2)'
In actuality, the number of bits allocated as the quantization bits is the number obtained by subtracting, from the bit count A, the number of bits needed for reporting the scale factor and number of quantization bits of each band.
Next, the bit allocator 23 calculates the MNR of each band sbi (i=0-31) in accordance with Equation (1) (step 102). When the MNR of each band sbi has been obtained, then the bit allocator 23 searches these MNRs for the smallest MNR (step 103) and increments the number of quantization bits in the band having the smallest MNR (step 104). More specifically, the quantization bit count Bi (i=0-31) is stored in memory means 23a for each band sbi (i=0-31) and the quantization bit count of the band conforming to the smallest MNR is incremented (Bi+1.fwdarw.Bi)
Next, the bit allocator 23 subtracts 36 from the allocable number of bits per frame (step 105). The reason for subtracting 36 is that there are 36 items of sampling data per band and the quantization bit count of each item of sampling data is incremented by one.
Thus, since the number of allocated bits has changed, the MNR of each band sbi is calculated again (step 106). Next, the bit allocator 23 compares the number A of allocable bits per frame with zero (step 107). If A is equal to or greater than zero, then loop processing from step 103 onward is repeated. If A is less than zero, then the bit allocator 23 adopts the immediately preceding number of allocated bits stored in the memory means 23a of each band sbi (i=0-31) as the final quantization bit count Bi (i=0-31).
Up to 14 bit rates (32-448 kbps) are stipulated for highly efficient coding of audio. The state of the art is such that if highly efficient encoding processing is applied to an audio encoder and an audio decoder, the bit rate allocated to video and the bit rate allocated to audio are each fixed, and the overall bit rate is the sum of the video and audio bit rates. The encoded video and audio data is transmitted at this bit rate.
An audio encoding apparatus for remote monitoring of stores and roads encodes and transmits even audio signals having little importance (audio signals during quiet periods or noisy periods in which there is much noise from the surroundings) at the preset fixed bit rate. Consequently, the conventional audio encoding method is undesirable in terms of effective utilization of transmission lines. That is, though it would suffice to transmit audio signals at a low bit rate during quiet and noisy periods, the prior art is such that transmission of audio code data at a variable bit rate cannot be done, thereby making transmission at a low bit rate impossible. In a case where the overall bit rate of the apparatus is held low, it is preferred that the bit rate of an audio signal having little importance be suppressed and the bit rate of important video be raised correspondingly. However, such audio encoding at a variable bit rate cannot be carried out by the conventional audio encoding method.