The field of this invention is speech technology generally and, in particular, methods and devices for analyzing, digitally-encoding, modifying and synthesizing speech or other acoustic waveforms.
Digital speech coding methods and devices are the subject of considerable present interest, particularly at rates compatible with conventional transmission lines (i.e., 2.4-9.6 kilobits per second). At such rates, the typical approaches to speech modeling, such as the so-called "binary excitation models", are ill-suited for coding applications and, even with linear predictive coding or other state of the art coding techniques, yield poor quality speech transmissions.
In the binary excitation models, speech is viewed as the result of passing a glottal excitation waveform through a time-varying linear filter that models the resonant characteristics of the vocal tract. It is assumed that the glottal excitation can be in one of two possible states corresponding to voiced or unvoiced speech. In the voiced speech state the excitation is periodic with a period which varies slowly over time. In the unvoiced speech state, the glottal excitation is modeled as random noise with a flat spectrum.
The above-referenced parent application, U.S. Ser. No. 712,866 discloses an alternative to the binary excitation model in which speech analysis and synthesis as well as coding can be accomplished simply and effectively by employing a time-frequency representation of the speech waveform which is independent of the speech state. Specifically, a sinusoidal model for the speech waveform is used to develop a new analysis-synthesis technique.
The basic method of U.S. Ser. No. 712,866 includes the steps of; (a) selecting frames (i.e. windows of about 20-40 milliseconds) of samples from the waveform; (b) analyzing each frame of samples to extract a set of frequency components; (c) tracking the components from one frame to the next; and (d) interpolating the values of the components from one frame to the next to obtain a parametric representation of the waveform. A synthetic waveform can then be constructed by generating a set of sine waves corresponding to the parametric representation. The disclosures of U.S. Ser. No. 712,866 are incorporated herein by reference.
In one illustrated embodiment described in detail in U.S. Ser. No. 712,866, the method is employed to choose amplitudes, frequencies, and phases corresponding to the largest peaks in a periodogram of the measured signal, independently of the speech state. In order to reconstruct the speech waveform, the amplitudes, frequencies, and phases of the sine waves estimated on one frame are matched and allowed to continuously evolve into the corresponding parameter set on the successive frame. Because the number of estimated peaks is not constant and is slowly varying, the matching process is not straightforward. Rapidly varying regions of speech such as unvoiced/voiced transitions can result in large changes in both the location and number of peaks. To account for such rapid movements in spectral energy, the concept of "birth" and "death" of sinusoidal components is employed in a nearest-neighbor matching method based on the frequencies estimated on each frame. If a new peak appears, a "birth" is said to occur and a new track is initiated. If an old peak is not matched, a "death" is said to occur and the corresponding track is allowed to decay to zero. Once the parameters on successive frames have been matched, phase continuity of each sinusoidal component is ensured by unwrapping the phase. In one preferred embodiment the phase is unwrapped using a cubic phase interpolation function having parameter values that are chosen to satisfy the measured phase and frequency constraints at the frame boundaries while maintaining maximal smoothness over the frame duration. Finally, the corresponding sinusoidal amplitudes are simply interpolated in a linear manner across each frame.
In speech coding applications, U.S. Ser. No. 712,866 teaches that pitch estimates can be used to establish a set of harmonic frequency bins to which the frequency components are assigned. (Pitch is used herein to mean the fundamental rate at which a speaker's vocal cords are vibrating). The amplitudes of the components are coded directly using adaptive differential pulse code modulation (ADPCM) across frequency or indirectly using linear predictive coding. In each harmonic frequency bin, the peak having the largest amplitude is selected and assigned to the frequency at the center of the bin. This results in a harmonic series based upon the coded pitch period. The phases are then coded by using the frequencies to predict phase at the end of the frame, unwrapping the measured phase with respect to this prediction and then coding the phase residual using 4-5 bits per phase peak.
At low data rates (i.e., 4.8 kilobits per second or less), there can sometimes be insufficient bits to code amplitude information, especially for low-pitched speakers using the above-described techniques. Similarly, at low data rates, there can be insufficient bits available to code all the phase information. There exists a need for better methods and devices for coding acoustic waveforms, particularly for coding speech at low data rates.