Audio signals are ubiquitous. They are transmitted as radio signals and as part of television signals. Other signals, such as speech, share pertinent characteristics with audio signals, such as the importance of spectral domain representations. For many applications, it is beneficial to store and transmit audio type data encoded in a digital form, rather than in an analogue form. Such encoded data is stored on various types of digital media, including compact audio discs, digital audio tape, magnetic disks, computer memory, both random access (RAM) and read only (ROM), just to name a few.
It is beneficial to minimize the amount of digital data required to adequately characterize an audio-type analogue signal. Minimizing the amount of data results in minimizing the amount of physical storage media that is required, thus reducing the cost and increasing the convenience of whatever hardware is used in conjunction with the data. Minimizing the amount of data required to characterize a given temporal portion of an audio signal also permits faster transmission of a digital representation of the audio signal over any given communication channel. This also results in a cost saving, since compressed data representing the same temporal portion of an audio signal can be sent more quickly, relative to uncompressed data, or can be sent over a communications channel having a narrower bandwidth, both of which consequences are typically less costly.
The principles of digital audio signal processing are well known and set forth in a number of sources, including Watkinson, John, The Art of Digital Audio., Focal Press, London (1988). An analogue audio signal x(t) is shown schematically in FIG. 1. The horizontal axis represents time. The amplitude of the signal at a time t is shown on the vertical axis. The scale of the time axis is in milliseconds, so approximately two thousandths of a second of audio signal is represented schematically in FIG. 1. A basic first step in the storage or transmission of the analogue audio signal as a digital signal is to sample the signal into discrete signal elements, which will be further processed.
Sampling the signal x(t) is shown schematically in FIG. 2. The signal x(t) is evaluated at many discrete moments in time, for example at a rate of 48 kHz. By sampling, it is meant that the amplitude of the signal x(t) is noted and recorded forty-eight thousand times per second. Thus, for a period of one msec (1×10−3 sec.), the signal x(t) will be sampled forty-eight times. The result is a temporal series x(n) of amplitudes, as shown in FIG. 2, with gaps between the amplitudes for the portions of the analogue audio signal x(t) which were not measured. If the sampling rate is high enough relative to the time-wise variations in the analogue signal, then the magnitudes of the sampled values will generally follow the shape of the analogue signal. As shown in FIG. 2, the sampled values follow signal x(t) rather well.
The outline of a general method of digital signal processing is shown schematically in FIG. 4a. The initial step of obtaining the audio signal is shown at 99 and the step of sampling is indicated at 102. Once the signal has been sampled, it is typically transformed from the time domain, the domain of FIGS. 1 and 2, to another domain that facilitates analysis. Typically, a signal in time can be written as a sum of a number of simple harmonic functions of time, such as cosωt and sinωt, for each of the various harmonic frequencies of ω. The expression of a time varying signal as a series of harmonic functions is treated generally in Feynman, R., Leighton, R., and Sands, M., the Feynman Lectures on Physics, Addison-Wesley Publishing Company, Reading, Mass. (1963) Vol. I, §50, which is incorporated herein by reference. Various transformation methods (sometimes referred to as “subband” methods) exist and are well known. Baylon, David and Lim, Jae, “Transform/Subband Analysis and Synthesis of Signals,” pp. 540-544, 2ssPA90, Gold Coast, Australia, Aug. 27-31 (1990). One such method is the Time-Domain Aliasing Cancellation method (“TDAC”). Another such transformation is known as the Discrete Cosine Transform (“DCT”). The transformation is achieved by applying a transformation function to the original signal. An example of a DCT transformation is:                                           X            ⁡                          (              k              )                                =                                    ∑                              n                =                0                                            N                -                1                                      ⁢                                                   ⁢                          2              ⁢                                                x                  ⁡                                      (                    n                    )                                                  ·                cos                            ⁢                              %                                  2                  ⁢                  N                                            ⁢                              k                                  (                                                            2                      ⁢                      n                                        +                    1                                    )                                                                    ,                                    ⁢                              for            ⁢                                                   ⁢            0                    ≦          k          ≦                      N            -            1                                                  =        0                                    ⁢                  otherwise          ,                    where k is the frequency variable and N is typically the number of samples in the window.
The transformation produces a set of amplitude coefficients of a variable other than time, typically frequency. The coefficients can be both real valued or they can be complex valued. (If X(k) is complex valued, then the present invention can be applied to the real and imaginary parts of X(k) separately, or the magnitude and phase parts of X(k) separately, for example. For purposes of discussion, it will be assumed, however, that X(k) is real valued.) A typical plot of a portion of the signal x(n) transformed to X(k) is shown schematically in FIG. 3. If the inverse of the transform operation is applied to the transformed signal X(k), then the original sampled signal x(n) will be produced.
The transform is taken by applying the transformation function to a time-wise slice of the sampled analogue signal x(n). The slice (known as a “frame”) is selected by applying a window at 104 to x(n). Various windowing methods are appropriate. The windows may be applied sequentially, or, more typically, there is an overlap. The window must be consistent with the transform method, in a typical case, the TDAC method. As shown in FIG. 2, a window w1(n) is applied to x(n), and encompasses forty-eight samples, covering a duration of one msec (1×10−3 sec). (Forty-eight samples have been shown for illustration purposes only. In a typical application, many more samples than forty-eight are included in a window.) The window w2(n) is applied to the following msec. The windows are typically overlapped, but non-overlapping windows are shown for illustration purposes only. Transformation of signals from one domain to another, for example from time to frequency, is discussed in many basic texts, including: Oppenheim, A. V., and Schafer, R. W., Digital Signal Processing, Englewood Cliffs, N.J. Prentice Hall (1975); Rabiner, L. R., Gold, B., Therory and Application of Digital Signal Processing, Englewood Cliffs, N.J., Prentice Hall, (1975), both of which are incorporated herein by reference.
Application of the transformation, indicated at 106 of FIG. 4a, to the window of the sampled signal x(n) results in a set of coefficients for a range of discrete frequency. Each coefficient of the transformed signal frame represents the amplitude of a component of the transformed signal at the indicated frequency. The number of frequency components is typically the same for each frame. Of course, the amplitudes of components of corresponding frequencies will differ from segment to segment.
As shown in FIG. 3, the signal X(k) is a plurality of amplitudes at discrete frequencies. This signal is referred to herein as a “spectrum” of the original signal. According to known methods, the next step is to encode the amplitudes for each of the frequencies according to some binary code, and to transmit or store the coded amplitudes.
An important task in coding signals is to allocate the fixed number of available bits to the specification of the amplitudes of the coefficients. The number of bits assigned to a coefficient, or any other signal element, is referred to herein as the “allocated number of bits” of that coefficient or signal element. This step is shown in relation to the other steps at 107 of FIG. 4a. Generally, for each frame, a fixed number of bits, N, is available. N is determined from considerations such as: the bandwidth of the communication channel over which the data will be transmitted; or the capacity of storage media; or the amount of error correction needed. As mentioned above, each frame generates the same number, C, of coefficients (even though the amplitude of some of the coefficients may be zero).
Thus, a simple method of allocating the N available bits is to distribute them evenly among the C coefficients, so that each coefficient can be specified by N/C bits. (For discussion purposes, it is assumed that N/C is an integer.) Thus, considering the transformed signal X(k) as shown in FIG. 3, the coefficient 32, having an amplitude of approximately one hundred, would be represented by a code word having the same number of bits (N/C) as would the coefficient 34, which has a much smaller amplitude, of only about ten. According to most methods of encoding, more bits are required to specify or encode a number within a larger range than are required to specify a number within a smaller range, assuming that both are specified to the same precision. For instance, to encode integers between zero and one hundred with perfect accuracy using a simple binary code, seven bits are required, while four bits are required to specify integers between zero and ten. Thus, if seven bits were allocated to each of the coefficients in the signal, then three bits would be wasted for every coefficient that could have been specified using only four bits. Where only a limited number of bits are available to allocate among many coefficients, it is important to conserve, rather than to waste bits. The waste of bits can be reduced if the range of the values is known accurately.
There are various known methods for allocating the number of bits to each coefficient. However, all such known methods result in either a significant waste of bits, or a significant sacrifice in the precision of quantizing the coefficient values. One such method is described in a paper entitled “High-Quality Audio Transform Coding at 128 Kbits/s”. Davidson, G., Fielder, L., and Antill, M., of Dolby Laboratories, Inc., ICASSP, pp 1117-1120, Apr. 3-6. Albuquerque, N. Mex. (1990) (referred to herein as the “Dolby paper”) which is incorporated herein by reference.
According to this method, the transform coefficients are grouped to form bands, with the widths of the bands determined by critical band analysis. Transform coefficients within one band are converted to a band block floating-point representation (exponent and mantissa). The exponents provide an estimate of the log-spectral envelope of the audio frame under examination, and are transmitted as side information to the decoder.
The log-spectral envelope is used by a dynamic bit allocation routine, which derives step-size information for an adaptive coefficient quantizer. Each frame is allocated the same number of bits, N. The dynamic bit allocation routine uses only the exponent of the peak spectral amplitude in each band to increase quantizer resolution for psychoacoustically relevant bands. Each band's mantissa is quantized to a bit resolution defined by the sum of a coarse, fixed-bit component and a fine, dynamically-allocated component. The fixed bit component is typically established without regard to the particular frame, but rather with regard to the type of signal and the portion of the frame in question. For instance, lower frequency bands may generally receive more bits as a result of the fixed bit component. The dynamically allocated component is based on the peak exponent for the band. The log-spectral estimate data is multiplexed with the fixed and adaptive mantissa bits for transmission to the decoder.
Thus the method makes a gross analysis of the maximum amplitude of a coefficient within a band of the signal, and uses this gross estimation to allocate the number of bits to that band. The gross estimate tells only the integral part of the power of 2 of the coefficient. For instance, if the coefficient is seven, the gross estimate determines that the maximum coefficient in the band is between 22 and 23 (four and eight), or, if it is twenty-five, that it is between 24 and 25 (sixteen and thirty-two). The gross estimate (which is an inaccurate estimate) causes two problems: the bit allocation is not accurate; the bits that are allocated are not used efficiently, since the range of values for any given coefficient is not known accurately. In the above procedure, each coefficient in a band is specified to the same level of accuracy as other coefficients in the band. Further, information regarding the maximum amplitude coefficients in the bands are encoded in two stages: first the exponents are encoded and transmitted as side information; second, the mantissa is transmitted along with the mantissa for the other coefficients.
In addition to determining how many bits to allocate to each coefficient for encoding that coefficient's amplitude, an encoding method must also divide the entire amplitude range into a number of amplitude divisions shown at 108 in FIG. 4a, and to allocate a code to each division, at 109. The number of bits in the code is equal to the number of bits allocated for each coefficient. The divisions are typically referred to as “quantization levels,” because the actual amplitudes are quantized into the available levels, or “reconstruction levels” after coding, transmission or storage and decoding. For instance, if three bits are available for each coefficient, then 23 or eight reconstruction levels can be identified.
FIG. 5 shows a simple scheme for allocating a three bit code word for each of the eight regions of amplitude between 0 and 100. The code word 000 is assigned to all coefficients whose transformed amplitude, as shown in FIG. 3, is between 0 and 12.5. Thus, all coefficients between 0 and 12.5 are quantized at the same value, typically the middle value of 6.25. The codeword 001 is assigned to all coefficients between 12.5 and 25.0, all of which are quantized to the value of 18.75. Similarly, the codeword 100 is assigned to all coefficients between 50.0 and 62.5, all of which are quantized to the value of 56.25. Rather than assigning uniform length codewords to the coefficients, with uniform quantization levels, it is also known to assign variable length codewords to encode each coefficient, and to apply non-uniform quantization levels to the coded coefficients.
It is also useful to determine a masking level. The masking level relates to human perception of acoustic signals. For a given acoustic signal, It is possible to calculate approximately the level of signal distortion (for example, quantization noise) that will not be heard or perceived, because of the signal. This is useful in various applications. For example, some signal distortion can be tolerated without the human listener noticing it. The masking level can thus be used in allocating the available bits to different coefficients.
The entire basic process of digitizing an audio signal, and synthesizing an audio signal from the encoded digital data is shown schematically in FIG. 4a and the basic apparatus is shown schematically in FIG. 4b. An audio signal, such as music, speech, traffic noise, etc., is obtained at 99 by a known device, such as a microphone. The audio signal x(t) is sampled 102, as described above and as shown in FIG. 2. The sampled signal x(n) is windowed 104 and transformed 106. After transformation (which may be a subband representation), the bits are allocated 107 among the coefficients, and the amplitudes of the coefficients are quantized 108, by assigning each to a reconstruction level and these quantized points are coded 109 by binary codewords. At this point, the data is transmitted 112 either along a communication channel or to a storage device.
The preceding steps, 102, 104, 106, 107, 108, 109, and 112 take place in hardware that is generally referred to as the “transmitter,” as shown at 150 in FIG. 4b. The transmitter typically includes a signal coder (also referred to as an encoder) 156 and may include other elements that further prepare the encoded signal for transmission over a channel 160. However, all of the steps mentioned above generally take place in the coder, which may itself include multiple components.
Eventually, the data is received by a receiver 164 at the other end of the data channel 160, or is retrieved from the memory device. As is well known, the receiver includes a decoder 166 that is able to reverse the coding process of the signal coder 156 with reasonable precision. The receiver typically also includes other elements, not shown, to reverse the effect of the additional elements of the transmitter that prepare the encoded signal for transmission over channel 160. The signal decoder 166 is equipped with a codeword table, which correlates the codewords to the reconstruction levels. The data is decoded 114 from binary into the quantized reconstruction amplitude values. An inverse transform is applied 116 to each set of quantized amplitude values, resulting in a signal that is similar to a frame of x(n), i.e. it is in the time domain, and it is made up of a discrete number of values, for each inverse transformed result. However, the signal will not be exactly the same as the corresponding frame of x(n), because of the quantization into reconstruction levels and the specific representation used. The difference between the original value and the value of the reconstruction level can not typically be recovered. A stream of inverse transformed frames are combined 118, and an audio signal is reproduced 120, using known apparatus, such as a D/A convertor and an audio speaker.