It is a basic premise of audio signal encoding techniques that if one has a perfect model of the instrument or device that is creating a sound, then the amount of data required to encode the sound will be very small, resulting in very high data compression ratios. For instance, to record a piano (or any other instrument) playing a single note, such as middle C, using full compact disk (CD) recording techniques (e.g., 44,100 samples per second, 16 bits per sample), results in a huge amount of information per second (e.g., 705.6 kbps or 88,200 bytes per second). However, if it is known that the sound being recorded emanates from a piano and both the sound analysis system that is recording the sound, and the receiving systems that will reproduce the recorded sound, have perfect models of the piano, then the only data required will be the data required to indicate the note being played (1 byte is more than sufficient to which of the 88 notes on a piano), and the note's amplitude (perhaps 1 additional byte), plus data sufficient to identify the beginning and ending of the playing of that note. (This is equivalent to the data on a printed page of music.) In a simple data recording system using a piano model, data identifying the piano note being played can be recorded once every sample period, where a typical sample period would be 10 or 20 milliseconds, resulting a data recording rate of 100 to 200 bytes per second. Obviously a data rate of 200 bytes per second represents a great deal of data compression from the full 88,200 bytes per second rate, and in fact indicates a compression ratio of 441 to 1. In more realistic, real world audio analysis and recording systems, compression ratios of 10 to 1 or so are generally considered to be very good.
As presented in U.S. Pat. No. 5,029,509, the use of sinusoidal modeling for speech and audio signals is well established. In audio signal analysis and recording systems using sinusoidal modeling, an audio signal is analyzed each sample period to determine the sinusoidal signal components of the signal during that sample period. For example, the sinusoidal components will often be a fundamental frequency component and a set of harmonics. Any portion of the signal not easily represented as sinusoidal components is typically represented as stochastic noise through the use of noise envelope parameters.
However, actual applications of sinusoidal modeling have been generally limited to single-speaker speech and single-instrument (monophonic) audio. More recently, there have been various attempts to perform sinusoidal modeling on wideband, polyphonic (or multisource) audio signals for the purposes of data compression. The present invention provides an improved audio signal analysis and representation method that provides significant benefits and better compression than the prior systems known to the inventors.
In traditional sinusoidal analysis methods, the input audio signal is first broken into uniformly sized segments (e.g., 5 to 50 millisecond segments), and then processed through one or several fast Fourier transforms (FFT) to determine the primary frequency components of the signal being processed. The process of breaking the input sound into segments is referred to in the literature as "windowing", or multiplying the input digital audio with a finite-length window function. Once the spectral peaks have been identified, parameters (such as frequency, amplitude, and phase) for each spectral component are determined, quantized and then stored or transmitted. This method works well if the input is a monophonic source, and the traditional analysis methods can determine what the single fundamental frequency happens to be.
In the case of general audio signal compression, there can be any number of audio sources (polyphonic) and thus multiple fundamental pitches. It is well known that the traditional methods of windowing and frequency component identification give poor results on wideband audio signals.
The present invention is premised on the theory that the aforementioned poor results are caused primarily by two problems: 1) a fundamental tradeoff between time resolution and frequency resolution, and 2) failure to accurately model the onset of each note or other audio event. The present invention also addresses the failure of prior art systems to provide graceful degredation of signal quality as the data transmisison bandwidth is gradually decreased and/or as an increasing fraction of the transmitted data is lost during transmission.
The tradeoff between time resolution and frequency resolution manifests itself in the following scenario. If signal analysis procedure is designed to have very good pitch resolution, say, .+-.5 Hz, which may be necessary for resolving bass notes, then the corresponding window will have to be about 200 milliseconds long. As a result, the analysis procedure will have very good pitch resolution, but the time resolution (i.e., the determination of the temporal onset and termination of each fequency component) will be very poor. Any time a partial begins (a new frequency track), its attack will be smeared across the entire window of 200 milliseconds. This makes the attack dull, and gives rise to a problem called "pre-echo". When a receiving system synthesizes an audio signal based on the audio parameters generated while using wide windows, synthesized coding error noise (like smeared partial attacks) is heard before the actual attack begins, this is known as "pre-echo".
Another problem associated with prior art audio data encoders is that the compressed audio data produced by those encoders is not easily scaled down to lower data rates. Most high-quality wideband audio algorithms in use as of the end of 1996 (such as MPEG and AC-3) use perceptual transform coders. In these systems the digital audio is broken into frames (usually 5 to 50 milliseconds long), each frame is converted into spectral coefficients using a time-domain aliasing cancellation filter bank, and then the spectral coefficients are quantized according to a psychoacoustic model. The most recent version of these "transform-based" audio coders, known as MPEG2-AAC, can have very good compression results. A CD-quality sound signal having 44100 samples per second and 16 bits per sample, having 22 kHz bandwidth and a data rate of 705.6 kbps is compressed to a signal having a data rate of about 64 kbps/sec, which represents a compression ratio of 11:1.
While 11:1 is a very good compression ratio, transform coders have their limitations. First of all, if the available transmission data rate (i.e., between a server system on which the compressed audio data is stored and a client decoder system) drops below 64 kbps, the sound quality decreases dramatically. In order to compensate for this loss of quality, the original audio input must be band limited in order to reduce the data rate of the compressed signal. For example, instead of compressing all audible frequencies from 0-20000 Hz, the encoding system may need to lowpass filter any frequencies above 5500 Hz in order to compress the audio to fit in a 28.8 kbps transmission channel, which is the typical bandwidth available using the modems most frequently found on desktop computers in 1997.
Another limitation of the transform encoders are that the encoding technique is not scalable. On a computer network like the Internet, the actual bandwidth available to a user with a 28.8 kbps modem is not guaranteed to be 28.8 kbps. Sometimes, maybe, the user will actually received 28.8 kbps, but the actual available bandwidth can easily drop at various times to 18 kbps, 6 kbps, or anywhere in between. If a transform coder compresses audio to generate encoded data having a data rate of 28.8 kbps, and the data rate suddenly drops to only 20 kbps, the audio quality of the sounds produced by client decoder systems will not gracefully degrade. Rather, the transform coder will produce silence, noise bursts, or poor time-domain interpolation. Clearly, it would be highly desirable for the quality of the sounds synthesized by client decoders to degrade gracefully as the available bandwidth decreases and when random data packets are dropped or lost during transmission. Gracefully degradation means that the listener will not hear silence or noise, but rather a gradual decrease in perceptual quality.