Typically, the problem of representing speech signals is approached by using a speech production model in which speech is viewed as the result of passing a glottal excitation waveform through a time-varying linear filter that models the resonant characteristics of the vocal tract. In many speech applications it suffices to assume that the glottal excitation can be in one of two possible states corresponding to voiced or unvoiced speech. In the voiced speech state the excitation is periodic with a period which is allowed to vary slowly over time relative to the analysis frame rate (typically 10-20 msecs). For the unvoiced speech state the glottal excitation is modelled as random noise with a flat spectrum. In both cases the power level in the excitation is also considered to be slowly time-varying.
While this binary model has been used successfully to design narrowband vocoders and speech synthesis systems, its limitations are well known. For example, often the excitation is mixed having both voiced and unvoiced components simultaneously, and often only portions of the spectrum are truly harmonic. Furthermore, the binary model requires that each frame of data be classified as either voiced or unvoiced, a decision which is particularly difficult to make if the speech is also subject to additive acoustic noise.
Speech coders at rates compatible with conventional transmission lines (i.e. 2.4-9.6 kilobits per second) would meet a substantial need. At such rates the binary model is ill-suited for coding applications. Additionally, speech processing devices and methods that allow the user to modify various parameters in reconstructing waveform would find substantial usage. For example, time-scale modification (without pitch alteration) would be a very useful feature for a variety of speech applications (i.e. slowing down speech for translation purposes or speeding it up for scanning purposes) as well as for musical composition or analysis. Unfortunately, time-scale (and other parameter) modifications also are not accomplished with high quality by devices employing the binary model.
Thus, there exists a need for better methods and devices for processing audible waveforms. In particular, speech coders operable at mid-band rates and in noisy environments as well as synthesizers capable of maintaining their perceptual quality of speech while changing the rate of articulation would satisfy long-felt needs and provide substantial contributions to the art.