This invention relates generally to music and speech synthesis and, in particular, to sinusoidal model-based synthesis.
In 1986, McAulay and Quatieri of Lincoln Laboratory, MIT, proposed to represent speech/music signals as a sum of sinusoids parameterized by time-varying amplitudes, frequencies and phases. See, R. J. McAuley and T. F. Quatieri, xe2x80x9cSpeech Analysis/Synthesis Based On A Sinusoidal Representation,xe2x80x9d IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, pp. 744-754, August 1986. Their Sinusoidal Transformation System (STS) based on this model greatly impacted the research and development of sinusoidal modeling-based music analysis/synthesis. Serra and Smith of Stanford University extended the sinusoidal model to include a stochastic part in their Spectral Modeling System (SMS). See, X. Serra, A System For Sound Analysis/Transformation/Synthesis Based On A Deterministic Plus Stochastic Decomposition, Ph.D. Thesis, Stanford University, Stanford, Calif., 1989. The extension provides a mechanism to model the audible characteristics and identity resulted from complicated turbulence in some sounds.
In both STS and SMS, the analysis and synthesis are performed on a frame-by-frame basis. In analysis, an average amplitude, frequency and phase for each sinusoid are obtained by measuring the magnitude, frequency and phase positions of each peak in the Fourier transform of the data frame. In synthesis, these parameters are interpolated to generate individual sine waves, and these sine waves are mixed to yield the sinusoidal part of the synthesized sound.
Generating those individual sine waves in a real-time music synthesizer imposes a major demand on the computation power. For example, a modern professional music synthesizer typically requires simultaneous generation of at least 32 notes. Each note contains about 40 sinusoids on average. Thus a total of 32xc3x9740≈1,200 sinusoids need to be generated in real-time at the sampling rate of at least 44.1 kHz. This requirement, when combined with other system overhead, make the implementation difficult even with present high speed digital signal processors (DSPs).
Reducing this computation requirement in synthesis is a first motivation for the present invention. In McAulay and Quatieri, above, the amplitude (in dB) and the phase track within a data frame are modeled by linear and cubic polynomials respectively. Clearly, the computational requirement for generating phase samples can be reduced by using quadratic phase polynomials in place of cubic ones. However, previous efforts in reducing the phase polynomial order have not been very successful. The main reason is that the phase and frequency, a total of four measurements at the two ends of a data frame, cannot in general be made in exact agreement with a quadratic polynomial, which has only three free parameters. The usual practice is to neglect phase measurements in favor of frequency measurements, but this seems to cause significant degradation in the fidelity of the synthesized sound. See, McAulay and Quatieri, above.
About 90% of the computational cost of an analysis-based music synthesis system using the oscillator bank approach is spent on generating the sinusoidal samples. Computation of the phase samples of the sinusoids takes about one-half of that cost (assuming sinusoidal values are pre-stored).
The invention provides a quadratic phase model approach to music and speech analysis and synthesis, wherein the polynomial coefficients are determined by least-square fitting the model using both frequency and phase measurements. Unlike methods using existing quadratic algorithms, which ignore either phase or frequency measurements at the boundaries of the data frame, the proposed quadratic phase interpolation algorithm method incorporates both measurements using a weighted least square frame algorithm. The underlying assumption is that the true frequency and phase at the two ends of a data frame conform to a quadratic phase model and the exact match between measured phase and frequency with the quadratic model is not necessary because of the noise in the measurements.
An advantage of the inventive approach is that the resulting frequency tracks for musical tones tend to be smoother (i.e. with less spurious oscillations) than the ones generated from the cubic algorithm. It can be shown (see below) that when the frequency does not vary much over a data frame, which is a typical case in a musical tone, the cubic-interpolated frequency track will always have slopes with opposite signs at the two ends of each data frame. This tends to cause oscillation in the interpolated frequency track as illustrated by the solid line in FIG. 1. Although the oscillation is typically small and hardly noticeable when the frequency track is plotted in usual scale, it is deemed undesirable for synthesizing musical tones.
Another advantage of the proposed approach is that it can be used to save storage requirements and reduce the computation complexity of the system. After the least square fitting is completed, the fitted frequency samples can be stored at the frame boundaries in place of the measured ones. Then the fitted phase track can be obtained simply by integration of the instantaneous frequency, which is taken to be the linear interpolation of the fitted frequency samples at the frame boundaries. This eliminates the need to store the phase samples at the frame boundaries and simplifies the computation needed to determine the polynomial coefficients. Compared with the commonly used cubic phase interpolation algorithm, the proposed algorithm eliminates one-third of the computational operations and reduces the parameter storage by 50%.
Informal listening tests on about two dozen musical notes analyzed reveal no performance degradation from the cubic phase interpolation algorithm to the proposed quadratic algorithm.