Tonal sounds can be effectively modeled as a sum of sinusoids with time-varying parameters consisting of frequency, amplitude, and phase. The key word here is "effectively" because, in fact, all sounds can be modeled as sums of sinusoids, but the number of sinusoids may be extremely large, and the time-varying sinusoidal parameters may not have intuitive significance. Colored noise signals like breath noise, ocean waves, and snare drums are examples of sounds that are not effectively modeled by sums of sinusoids. Pitched musical instruments such as clarinet, trumpet, gongs, and certain cymbals, as well as ensembles of these instruments are examples of tonal sounds that are effectively modeled as sums of sinusoids.
Many sounds are modeled as a combination of tonal and non-tonal, or colored noise, sounds. Flute and violin both have tonal and colored noise components. Human speech is often modeled as a mixture of tonal or "voiced" speech, and colored noise or "unvoiced" speech. The present invention is concerned with encoding and synthesizing tonal audio signals. This invention can be used in conjunction with systems for encoding and synthesizing non-tonal or colored noise signals.
Pitched signals are a special class of tonal audio signals in which the sinusoidal frequencies are harmonically related. The present invention can be used for encoding and synthesizing both pitched and unpitched tonal audio signals. Specifically optimized embodiments are proposed for encoding and synthesizing pitched tonal audio signals.
In this specification we use the term "tonal audio signal" to refer to all audio signals that can be effectively modeled as a sum of sinusoids with time-varying parameters consisting of frequency, amplitude, and phase. These are all signals that are not noise-like in character. We use the term "pitched tonal audio signal" or simply "pitched signal" to refer to tonal audio signals whose sinusoidal frequencies are harmonically related. The term "voiced signal" is a common term of art that refers to the pitched tonal audio signal component of a speech signal. The term "unvoiced signal" is a term of art that refers to the noise-like component of a speech signal. This is the non-tonal part of the signal that cannot be effectively modeled as a sum of sinusoids with time-varying parameters consisting of frequency, amplitude, and phase.
One method of encoding and synthesizing tonal audio signals is additive sinusoidal encoding and synthesis. This method provides excellent results since the encoding and synthesis model is the same model as the signal: a sum of sinusoids with time-varying parameters. U.S. Pat. Nos. 4,885,790 and 4,937,873, both to McCauley et. al, and U.S. Pat. No. 4,856,068, to Quatieri, J R. et al., teach systems for encoding and synthesizing sound waveforms as a sums of sinusoids with time-varying amplitude, frequency, and phase. While sinusoidal encoding and synthesis provides excellent results for tonal audio signals, the synthesis requires large computational resources because many tonal audio signals may involve one hundred or more individual sinusoids.
To reduce the computational requirement of sinusoidal synthesis U.S. Pat. Nos. 5,401,897 to Depalle et al., 5,686,683, to Freed, and 5,327,518 teach systems for sinusoidal synthesis using Inverse Fast Fourier Transform (IFFT) techniques. While this approach reduces somewhat the computation requirements for synthesis of a large number of parameters, the computation is still expensive and new problems are introduced. Many synthesis environments, for example musical synthesizers, require multi-channel output. Using IFFT approaches, a separate IFFT system must be used for every channel. In addition, IFFT systems limit sinusoidal parameter update to once per frame, where a frame_length must be at least as long as the lowest frequency period. This parameter update rate may be insufficient at higher frequencies.
U.S. Pat. Nos. 5,581,656, 5,195,166, and 5,226,108, all to Hardwick et al., teach a system where a certain number of sinusoids, the dominant or low-frequency sinusoids, are synthesized using traditional time-domain sinusoidal additive synthesis, while the remaining sinusoids are synthesized using an IFFT approach. This permits higher update rate for the dominant sinusoid components while taking advantage of the lower IFFT computation rate for the bulk of the sinusoids. This approach has the disadvantages of IFFT computation cost especially with multi-channel synthesis. In addition, the dominant sinusoid components are usually at lower frequencies and it is the higher that often require an increased parameter update rate.
A number of less compute-intensive systems have been proposed for encoding and synthesizing tonal audio signals. Linear Predictive Coding (LPC) is well known in the art of speech coding and synthesis. Methods for using LPC for synthesizing tonal or voiced speech concentrate on methods for generating the tonal excitation signal. The numerous approaches include, generating a pulse-train at the desired pitch, generating a multi-pulse excitation signal at the desired pitch, vector quantizing (VQ) the excitation signal, and simply transmitting the excitation signal with fewer bits. U.S. Pat. No. 5,744,742, to Lindemann et al., teaches a system for encoding excitation signals as single pitch period loops. To synthesize excitation signals at different pitches or amplitudes, weighted sums of pitch period excitation signal loops are created. The excitation signal pitch periods are stored in single pitch period waveform memory tables. The phase response of all excitation signal waveforms is forced to be the same so that weighted sums of the waveforms do not cause phase cancellation. All of these techniques with the exception of simply transmitting the excitation signal give poorer results than full additive sinusoidal encoding and synthesis. The pulse based techniques in particular sound "buzzy" and unnatural.
U.S. Pat. Nos. 5,369,730 to Yajima, 5,479,564 to Vogten et al., European Patent 813,184 A1 to Dutoit et al., European Patents 0,363,233A1 and 0,363,233B1, both to Hamon, teach methods of pitch synchronous concatenated waveform encoding and synthesis. With this method a number of single pitch period waveforms are stored in memory. To synthesize a time-varying signal, a sequence of single pitch period waveforms is selected from waveform memory and concatenated over time. The waveform are usually overlap-added for continuity. To shift the pitch of the synthesized signal the overlap rate is modulated. While relatively inexpensive in terms of compute resources, this approach suffers from distortions especially associated with the pitch shifting mechanism. Is audibly inferior to full additive synthesis for most tonal audio signals.
In the music synthesizer field, an approach similar concatenated waveform synthesis is referred to as waveform sequencing. With waveform sequencing each single pitch period waveform is pitch shifted using sample rate conversion techniques and looped for a specified time to generate a stable magnitude spectrum. To generate time-varying magnitude spectra the waveforms are generally cross-faded over time. U.S. Pat. Nos. 3,816,664, to Koch, 4,348,929, to Gallitzendorfer, 4,461,199 and Reissue 34,913, to Hiyoshi et al., and U.S. Pat. No. 4,611,522 to Hideo teach systems of waveform sequencing relative to music synthesis. Waveform sequencing can be economical in computation resources but much of the complex time-varying character of the magnitude spectra is lost due to reduction to a limited number of waveforms.
A number of hybrid systems have been proposed that use additive sinusoidal encoding and synthesis for one part of a signal--usually the tonal part--and some other technique for the another part of the signal--usually the colored noise part. U.S. Pat. No. 5,029,509 to Serra et al. teaches a system for full sinusoidal encoding and synthesis of the tonal part of a signal and LPC coding of the non-tonal part of the signal. This approach has the computational expense of full sinusoidal additive encoding and synthesis plus the expense of LPC coding and synthesis. A similar approach is applied to speech signals in U.S. Pat. Nos. 5,774,837, to Yeldener et al., and U.S. Pat. No. 5,787,387 to Aquilar.
In "A Switched Parametric & Transform Audio Coder", Scott Levine et al., Proceedings of the IEEE ICASSP, May 15-19, 1999 Phoenix, Ariz., a system is taught wherein low frequencies are encoded and synthesized using full sinusoidal additive synthesis, and high frequencies are encoded using LPC with a white noise excitation signal. This is economical in terms of computation, but the high-frequency synthesized signal sounds excessively noise-like for tonal audio signals. A similar approach is applied to voiced speech signals in "HNS: Speech Modification Based on a Harmonic+Noise Model," J. Laroche et al., Proceedings of IEEE ICASSP, April 1993, Minneapolis, Minn. The use of colored noise to model the high frequencies of tonal audio signals is less objectionable when applied to speech signals, but still results in some "buzzyness" at high frequencies.
U.S. Pat. No. 5,806,024, to Ozawa, teaches a system wherein the short time magnitude spectrum of the tonal audio signal is determined in frames. The tonal audio signal is assumed to have a harmonic component with time-varying pitch. The pitch varies slowly enough that it can be considered constant over each frame. For each frame, a pitch is determined. A harmonic spectrum is determined for each frame as the values of the magnitude spectrum at multiples of the pitch frequency. A residual spectrum is determined for each frame as the magnitude spectrum minus the harmonic spectrum. The harmonic spectrum frames and residual spectrum frames are vector quantized (VQ) to form a harmonic spectrum codebook, residual spectrum codebook, and a gain codebook. The signal is encoded as sequence of unique coding vector numbers identifying coding vectors in these codebooks. Thus the harmonic spectrum codebook sequence codes the pitched part of the signal, and the residual codebook sequence codes the non-tonal and non-pitched-but-tonal part of the signal. This approach can be economical but with VQ, much of the richness in time-varying behavior is lost. This is especially true for complex tonal audio signals such as high-fidelity music signals.