For many years the most popular approach to representing speech signals parametrically has been linear predictive (LP) modeling. Linear prediction is described by J. Makhoul, "Linear Prediction: A Tutorial Review," Proc. IEEE, vol. 63, pp. 561-580, April 1975. In this approach, the speech production process is modeled as a linear time-varying, all-pole vocal tract filter driven by an excitation signal representing characteristics of the glottal waveform. While many variations on this basic model have been widely used in low bit-rate speech coding, the formulation known as pitch-excited LPC has been very popular for speech synthesis and modification as well. In pitch-excited LPC, the excitation signal is modeled either as a periodic pulse train for voiced speech or as white noise for unvoiced speech. By effectively separating and parameterizing the voicing state, pitch frequency and articulation rate of speech, pitch-excited LPC can flexibly modify analyzed speech as well as produce artificial speech given linguistic production rules (referred to as synthesis-by-rule).
However, pitch-excited LPC is inherently constrained and suffers from wellknown distortion characteristics. LP modeling is based on the assumption that the vocal tract may be modeled as an all-pole filter; deviations of an actual vocal tract from this ideal thus result in an excitation signal without the purely pulse-like or noisy structure assumed in the excitation model. Pitch-excited LPC therefore produces synthetic speech with noticeable and objectionable distortions. Also, LP modeling assumes a priori that a given signal is the output of a time-varying filter driven by an easily represented excitation signal, which limits its usefulness to those signals (such as speech) which are reasonably well represented by this structure. Furthermore, pitch-excited LPC typically requires a "voiced/unvoiced" classification and a pitch estimate for voiced speech; serious distortions result from errors in either procedure.
Time-frequency representations of speech combine the observations that much speech information resides in the frequency domain and that speech production is an inherently non-stationary process. While many different types of time-frequency representations exist, to date the most popular for the purpose of speech processing has been the short-time Fourier transform (STFT). One formulation of the STFT, discussed in the article by J. L. Flanagan and R. M. Golden, "Phase Vocoder," Bell Sys. Tech. J., vol. 45, pp. 1493-1509, 1966, and known as the digital phase vocoder (DPV), parameterizes speech production information in a manner very similar to LP modeling and is capable of performing speech modifications without the constraints of pitch-excited LPC.
Unfortunately, the DPV is also computationally intensive, limiting its usefulness in real-time applications. An alternate approach to the problem of speech modification using the STFT is based on the discrete short-time Fourier transform (DSTFT), implemented using a Fast Fourier Transform (FFT) algorithm. This approach is described in the Ph.D. thesis of M. R. Portnoff, Time-Scale Modification of Speech Based on Short-Time Fourier Analysis, Massachusetts Institute of Technology, 1978. While this approach is computationally efficient and provides much of the functionality of the DPV, when applied to modifications the DSTFT generates reverberant artifacts due to phase distortion. An iterative approach to phase estimation in the modified transform has been disclosed by D. W. Griffin and J. S. Lira in "Signal Estimation from Modified Short-Time Fourier Transform," IEEE Trans. on Acoust., Speech and Signal Processing, vol. ASSP-32, no. 2, pp. 236-242, 1984. This estimation technique reduces phase distortion, but adds greatly to the computation required for implementation.
Sinusoidal modeling, which represents signals as sums of arbitrary amplitude- and frequency-modulated sinusoids, has recently been introduced as a high-quality alternative to LP modeling and the STFT and offers advantages over these approaches for synthesis and modification problems. As with the STFT, sinusoidal modeling operates without an "all-pole" constraint, resulting in more natural sounding synthetic and modified speech. Also, sinusoidal modeling does not require the restrictive "source/filter" structure of LP modeling; sinusoidal models are thus capable of representing signals from a variety of sources, including speech from multiple speakers, music signals, speech in musical backgrounds, and certain biological and biomedical signals. In addition, sinusoidal models offer greater access to and control over speech production parameters than the STFT.
The most notable and widely used formulation of sinusoidal modeling is the Sine-Wave System introduced by McAulay and Quatieri, as described in their articles "Speech Analysis/Synthesis Based on a Sinusoidal Representation," IEEE Trans. on Acoust., Speech and Signal Processing, vol. ASSP-34, pp. 744-754, August 1986, and "Speech Transformations Based on a Sinusoidal Representation," IEEE Trans. on Acoust., Speech and Signal Processing, vol. ASSP-34, pp. 1449-1464, December 1986. The Sine-Wave System has proven to be useful in a wide range of speech processing applications, and the analysis and synthesis techniques used in the system are well-justified and reasonable, given certain assumptions.
Analysis in the Sine-Wave System derives model parameters from peaks of the spectrum of a windowed signal segment. The theoretical justification for this analysis technique is based on an analogy to least-squares approximation of the segment by constant-amplitude, constant-frequency sinusoids. However, sinusoids of this form are not used to represent the analyzed signal; instead, synthesis is implemented with parameter tracks created by matching sinusoids from one frame to the next and interpolating the matched parameters using polynomial functions.
This implementation, while making possible many of the applications of the system, represents an uncontrolled departure from the theoretical basis of the analysis technique. This can lead to distortions, particularly during non-stationary portions of a signal. Furthermore, the matching and interpolation algorithms add to the computational overhead of the system, and the continuously variable nature of the parameter tracks necessitates direct evaluation of the sinusoidal components at each sample point, a significant computational obstacle. A more computationally efficient synthesis algorithm for the Sine-Wave System has been proposed by McAulay and Quatieri in "Computationally Efficient Sine-Wave Synthesis and its Application to Sinusoidal Transform Coding," Proc. IEEE Int'l Conf. on Acoust., Speech and Signal Processing, pp. 370-373, April 1988, but this algorithm departs even farther from the theoretical basis of analysis.
Many techniques for the digital generation of musical sounds have been studied, and many are used in commercially available music synthesizers. In all of these techniques a basic tradeoff is encountered; namely, the conflict between accuracy and generality (defined as the ability to model a wide variety of sounds) on the one hand and computational efficiency on the other. Some techniques, such as frequency modulation (FM) synthesis as described by J. M. Chowning, "The Synthesis of Complex Audio Spectra by Means of Frequency Modulation," J. Audio Eng. Soc., vol. 21, pp. 526-534, September 1973, are computationally efficient and can produce a wide variety of new sounds, but lack the ability to accurately model the sounds of existing musical instruments.
On the other hand, sinusoidal additive synthesis implemented using the DPV is capable of analyzing the sound of a given instrument, synthesizing a perfect replica and performing a wide variety of modifications. However, as previously mentioned, the amount of computation needed to calculate the large number of time-varying sinusoidal components required prohibits real-time synthesis using relatively inexpensive hardware. As in the case of time-frequency speech modeling, the computational problems of additive synthesis of musical tones may be addressed by formulating the DPV in terms of the DSTFT and to implement this formulation using FFT algorithms. Unfortunately, this strategy produces the same type of distortion when applied to musical tone synthesis as to speech synthesis.
There clearly exists a need for better methods and devices for the analysis, synthesis and modification of audio waveforms. In particular, an analysis/synthesis system capable of altering the pitch frequency and articulation rate of speech and music signals and capable of operating with low computational requirements and therefore low hardware cost would satisfy long-felt needs and would contribute significantly to the art.