Synthesis of acoustic waveforms has applications in speech and musical processing. When an acoustic waveform is parametrically represented (e.g. modeled as a sum of sinusoids with time-varying amplitudes, frequencies and phases), data reduction, effective modification of time and frequency (pitch) and flexible control for the resynthesis of the waveform can be achieved.
In the field of speech signal processing, research on the synthesis and coding of speech signals has been motivated by the speech production model, where the speech waveform s(t) is assumed to be the output of passing a glottal excitation waveform e(t) through a linear time-varying system with frequency response H(f, t), representing the characteristics of the vocal tract. The excitation waveform e(t) can be modeled as a sum of sinusoids. From this speech production model, the so-called source-filter model (SFM) for speech synthesis follows naturally, as shown in FIG. 1. See, McAulay et al., "Speech Analysis/synthesis Based on Sinusoidal Represention," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-34, pp. 744-754, Aug. 1986; and Quatieri et al., "Speech Transformations Based on a Sinusoidal Representation," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-34, pp. 1449-1464, Dec. 1986. As indicated in FIG. 1, the sinusoidal parameters, i.e., the time-varying amplitudes a.sub.k (t), frequencies f.sub.k (t) and phases .phi..sub.k (t), k=1, 2, . . . , L(m), where L(m) is the number of sinusoids at frame m, and the frequency responses of the vocal tract H(f.sub.k,t) are all jointly estimated during the analysis of the original speech signal.
The source-filter model has several disadvantages when used for synthesizing usical instrument sounds. First, according to Quatieri et al., above, the filtering of the excitation through the vocal tract model filter is done in the frequency domain and the frequency responses H(f.sub.k, t) are stored. However, due to the need for frequency modification (pitch transposition) with musical instrument sounds, either more frequency response points will have to be stored or additional frequency response values will have to be calculated using interpolation. This results in an increase in the amount of data storage or a requirement for the performance of additional computations. Second, because of dynamic change in amplitude of each individual sinusoid, the quality of the resulting acoustic waveform is more sensitive to the possible phase discontinuities at frame boundaries. Third, when L(m) is large, the computational requirement of the source-filter model is difficult to meet for real-time implementation using existing low cost programmable digital signal processors (DSPs). Finally, the speech production model does not apply for music synthesis, and there is no justification for extracting an excitation and vocal tract type filter from a musical instrument sound.