Audio encoding is a process by which a typically digitized audible sound or "audio source" is converted ("encoded") into an "encoded audio" form for storage, transfer and/or manipulation. In a complimentary fashion, audio decoding converts encoded audio (typically received from storage and/or via data transfer) into decoded audio, which can then be rendered and played back as audible sound. An audio encoding system typically includes at least one encoder and one decoder as integrated elements within one or more host processing systems.
Particularly problematic in encoding system design have been conflicting requirements of providing high fidelity audio, in a desired form, and using a minimally complex system. More specifically, an audio encoder will ideally deliver perceptually "lossless" encoded audio. That is, the encoded audio, when decoded and rendered, should sound identical to the source audio (i.e. with no audible artifacts or other perceivable loss of fidelity). On the other hand, it is also desirable to minimize the amount of encoded data in order to preserve available storage, throughput and bandwidth for other uses. To pursue these requirements, encoding system designers have relied largely on established data reduction methods which should theoretically preserve audio fidelity. However, to achieve high fidelity, such methods have failed to provide sufficiently low bit rates. Conversely, these methods, particularly when merely approximated to reduce bit rates and/or complexity, have not assured high fidelity.
Adding to these problems is the manner in which "processed" encoded audio is typically provided using conventional low bit-rate encoding methods. Processing, such as time and frequency modification, is conducted on non-encoded audio. Audio is typically stored and/or transferred in encoded form (thereby conserving storage or bandwidth), then decoded, then time or frequency modified, then re-encoded, and then once again transferred and/or stored (again conserving system resources). However, given the complexity of conventional encoding methods, decoding and re-encoding has become computationally expensive.
For example, in transform coding, a digital audio source is broken down into frames (typically about 5 to 50 milliseconds long). Each frame is then converted into spectral coefficients using a time-domain aliasing cancellation filter bank. Finally, the spectral coefficients are quantized according to a psychoacoustic model. During decoding, the quantized spectral coefficients are used to re-synthesize the encoded audio.
Advantageously, transform coding is relatively computationally efficient and is capable of producing perceptually lossless encoding. It is therefore preferred where high fidelity encoding of an audio source is critical. Unfortunately, such high fidelity comes at the cost of a large amount of encoded data or "high bit rate". For example, a one minute audio would produce 480 kilobytes of transform-coded audio data, resulting in a compression ratio of only 11 to 1. Thus, transform coding is considered inappropriate for high compression applications. In addition, conventional methods used to quantize transform coded audio data can result in a substantial loss of its high fidelity benefits. Broadly stated, quantization is a form of data reduction in which approximations, which are considered substantially representative of actual data, are substituted for the actual data. Conventionally, transform coded data is quantized by encoding less than the complete frequency range of the audio source. Yet another disadvantage is that transform coding is that it is not considered amenable to time or frequency modification. With time modification, audio data is modified to playback faster or slower at the same pitch. Conversely, frequency modification alters the playback pitch without affecting playback speed.
Another example, conventional sinusoidal modeling, is used as an alternative to transform coding. In sinusoidal modeling, an entire audio data stream is analyzed in time increments or "windows". For each window, a fast Fourier transform ("FFT") is used to determine the primary audio frequency components or "spectral peaks" of the source audio. The spectral peaks are then modeled as a number of sine waves, each sine wave having specific amplitude, frequency and phase characteristics. Next, these characteristics or "sinusoidal parameter triads" are quantized. The resultant encoded audio is then typically stored and/or transferred. During decoding, each of the representative sine waves is synthesized from a corresponding set of sinusoidal parameters.
An advantage of sinusoidal modeling is that it tends to represent an audio source using a relatively small amount of data. For example, the above 1 minute audio source can be represented using only 120 kilobytes of encoded audio. Comparing the encoded audio to the audio source, this represents a data reduction or "compression ratio" of 44 to 1. (In practice, encoder designs are generally targeted at achieving compression ratio of 10 to 1 or more.)
Sinusoidal modeling is also generally well-suited to such audio data modifications as time and frequency scaling. In sinusoidal modeling, time scaling is conventionally achieved by altering the decoder's window length relative to the window length in the encoder. Frequency modification is further conventionally achieved by scaling the frequency information in the "parameter triads". Both are well established methods.
Unfortunately, conventional sinusoidal modeling also has certain disadvantages. First, sinusoidal modeling poorly models certain audio components. Sound can be viewed as comprising a combination of short tonal or atonal attacks or "transients" (e.g. striking a drum), as well as relatively stable tonal components ("steady-state") and noise components. While sinusoidal modeling represents steady-state portions relatively well, it does a relatively poor job of representing transients and noise. Transients, many of which are crisp combinations of tone and noise, tend to become muddled. Further, an attempt to represent transients or noise using sinusoids requires an large number of short sine waves, thereby increasing bit rate. In addition, while sinusoidal encoding is generally well suited to data modification or "audio processing", it tends to exaggerate the above deficiencies with regard to transients. For example, time compression and expansion tend to unnaturally sharpen or muddle transients, and frequency modification tends to unnaturally color the tone quality of transients.
Another sinusoidal modeling disadvantage is that conventional methods used to quantize sinusoidal models tend to cause a readily perceived degradation of audio fidelity. One approach to sinusoidal model quantization is based on established human hearing limitations. It is well-known that, where a listener is presented with two sound components that are close in frequency, a lower energy component can be masked by a higher energy component. More simply, louder audio components can mask softer ones. Thus, an analysis of an audio source can be conducted according to a "psychoacoustic model" of human hearing. Then, those frequency components which the model suggests would be masked (i.e. would not be heard by a listener) are discarded.
This first approach is typically implemented according to one of the following methods. In a first method, masking is presumed. That is, all sinusoids measured as being below a predetermined threshold energy level are summarily presumed to be masked and are therefore discarded. In a second method, a psychoacoustic analysis is performed on each frame of sinusoids and those sinusoids which are deemed inaudible due to determined masking effects are discarded. A third method adds an iterative aspect to the frame-by-frame psychoacoustic modeling of the second method. In this case, sinusoids are discarded in an order of decreasing likelihood of being masked (e.g. the sinusoid that is most likely to be masked is discarded first, then the next most likely, and so on). This process of discarding is repeated until the remaining amount of audio data bits within the frame (i.e. frame "bit rate") is within a predetermined maximum. Unfortunately, while psychoacoustic modeling might be expected to provide predictable results, this inventor's listening tests have revealed variably degraded audio fidelity when using each of these methods.
A second approach to sinusoidal model quantization is based on an alternative presumption regarding human hearing limitations. As discussed, sinusoidal modeling selects tonal peaks as representative of a window of audio data. It has been observed, however, that a series of consecutive tonal peaks will tend to vary linearly, such that the entire series can be represented by a single peak-to-peak line segment or "trajectory". Thus, the amount of data required to represent the series of tonal peaks can be reduced by replacing a series of sinusoid parameters with its corresponding trajectory. The presumption here is that a sufficiently short trajectory, depending upon the nature of the audio source (e.g. speech, mono or polyphonic, specific musical work, etc.), will not be heard. In practice, a user of the encoding system sets a threshold trajectory length according to the nature of the audio source. Thereafter, all trajectories that are shorter than the threshold are summarily discarded. Once again, this inventor's listening tests have revealed various degrees of degraded audio fidelity using this method.
A different approach to signal modification is the use of the phase vocoder. The phase vocoder splits a signal into frame, from length 6 to 50 msec long, and performs an FFT on each frame. This complex FFT data is converted into separate magnitude and phase information. In order to time-stretch the audio, the magnitude and phase data are temporally interpolated. Then the inverse FFT synthesis window is shorter or longer than the original analysis window length, depending on the desired time-stretching factor. While the phase vocoder sounds quite good for large time-scale stretching factors, it is not designed as a data compression tool. To date, no one has shown a phase vocoder that can both perform data compression and time-scale modification. In addition, the phase vocoder has difficulties handling transients; attack transients will sound smeared when time-scaled using the phase vocoder.
This inventor's prior U.S. patent application Ser. No. 09/007,995, filed Jan. 16, 1998, teaches a number of improvements to conventional encoding methods. Among these improvements are multi resolution sinusoidal encoding, sinusoidal transient encoding, and a composite encoding system employing these encoding methods in combination with noise modeling. U.S. patent application Ser. No. 09/007,995 is hereby incorporated by reference as if repeated verbatim immediately hereinafter.
Multiresolution sinusoidal encoding and sinusoidal transient modeling can be viewed in a rather simplified, summary fashion for present purposes as follows. Multiresolution sinusoidal encoding, among other aspects, broadly includes the use of variable window sizes for analyzing and encoding source audio. Preferably, selected source audio frequency bands are matched to corresponding window sizes. Thus, fidelity is improved by using an optimal window size for each frequency band. Sinusoidal transient modeling ("STM"), among other aspects, broadly includes performing a long audio frame discrete cosine transform, dividing the result into smaller frames and then performing an FFT on the smaller frames to produce frequency domain encoded audio.
The advent of multiresolution sinusoidal encoding and STM provides certain advantages. Most relevant to the present discussion is that the two methods can be readily combined to form a composite encoder that facilitates low complexity quantization and compression-mode processing. More specifically, the two methods are well matched. Both are full-frequency methods (i.e. encode the entire frequency range of an audio source) and both form encoded audio that is comprised of sinusoids. Thus, the two methods are necessarily compatible with one another. In addition, similar sinusoid-based quantization and processing methods can be utilized not only for both methods, but for both methods over the entire frequency range of the encoded audio. Further, methods for quantizing and processing sinusoids, alone or with the residual captured by noise, are well known.
Unfortunately, the above combination also presents certain disadvantages. For example, while multiresolution sinusoidal modeling enables higher fidelity encoding, it does so at the cost of a higher bit-rate. Also, while STM provides generally high fidelity, listening tests have revealed degraded fidelity where polyphonic sources are encoded. (Note that polyphonic music encoding using sinusoidal modeling is a relatively new concept.) Listening tests and experimentation have also revealed that certain characteristics of the combination can be vastly improved, particularly where conventional quantization and processing have been relied upon. For example, time and frequency compression ratios must be limited if fidelity is to be retained.
Finally, given the successful implementation of newly discovered techniques of the invention, certain conventional assumptions and methods are clearly problematic. In addition, conventional data reduction versus fidelity and complexity tradeoffs can be improved with regard to both singular and composite encoding systems.
Accordingly, there is a need for audio encoding/decoding apparatus and methods capable of providing improved data reduction versus fidelity and complexity tradeoffs with regard to both singular and composite encoding systems. There is further a need for an encoding/decoding apparatus and methods that facilitates high fidelity compression domain processing.