The field of this invention is speech technology generally and, in particular, methods and devices for analyzing, digitally encoding and synthesizing speech or other acoustic waveforms.
Systems for digital encoding and synthesis of speech are the subject of considerable present interest, particularly at rates compatible with existing transmission lines, which commonly carry digital information at 2.4-9.6 kilobits per second. At such rates, conventional systems based upon speech waveform modeling are inadequate for coding applications and yield poor quality speech transmission, even if linear predictive coding (LPC) and other efficient coding techniques are used.
Typically, the problem of representing speech signals is approached by using a speech production model in which speech is viewed as the result of passing a glottal excitation waveform through a time-varying, linear filter that models the resonant characteristics of the vocal tract. In a so-called "binary excitation model," it is assumed that the glottal excitation can be in one of two possible states corresponding to voiced or unvoiced speech.
In the voiced speech state, the excitation is periodic with a period which is allowed to vary slowly over time relative to the analysis frame rate, typically 10-20 msecs For the unvoiced speech state, the glottal excitation is modeled as random noise with a flat spectrum In both cases, the power level in the excitation is also considered to be slowly time-varying.
While this binary model has been used successfully to design narrowband vocoders and speech synthesis systems, its limitations are well known. For example, the speech excitation is often mixed, having both voiced and unvoiced components simultaneously, and often only portions of the spectrum are truly harmonic. Additionally, the binary model requires that each frame of data be classified as either voiced or unvoiced, a decision which is difficult to make if the speech is subject to additive acoustic noise.
The above-referenced parent application, U.S. Ser. No. 712,866, discloses an alternative to the binary excitation model in which speech analysis and synthesis, as well as coding, can be accomplished simply and effectively by employing a time-frequency representation of the speech waveform which is independent of the speech state. In particular, a sinusoidal model for the speech waveform is utilized to develop a new analysis and synthesis method.
The basic method of U.S. Ser. No. 712,866 includes the steps of (i) selecting frames--i.e. windows of approximately 20-60 milliseconds--of samples from the waveform; (ii) analyzing each frame of samples to extract a set of frequency components; (iii) tracking the components from one frame to the next; and (iv) interpolating the values of the components from one frame to the next to obtain a parametric representation of the waveform. A synthetic waveform can then be constructed by generating a set of sine waves corresponding to the parametric representation. The disclosures of U.S. Ser. No. 712,866 are incorporated herein by reference.
In one illustrated embodiment described in detail in U.S. Ser. No. 712,866, the basic method is utilized to select amplitudes, frequencies and phases corresponding to the largest peaks in a periodogram of the measured signal, independently of the speech state. In order to reconstruct the speech waveform, the amplitudes, frequencies and phases of the sine waves estimated on one frame are matched and allowed to continuously evolve into the corresponding parameter set on the next frame.
Because the number of estimated peaks is not constant and is slowly varying, the matching process is not straightforward. Rapidly varying regions of speech, such as unvoiced/voiced transitions, can result in large changes in both the location and number of peaks.
To account for such rapid movements in spectral energy, the concept of "birth"0 and "death" of sinusoidal components is employed in a nearest-neighbor matching method based on the frequencies estimated on each frame. If a new peak appears, a "birth" is said to occur and a new track is initiated. If an old peak is not matched, a "death" is said to occur and the corresponding track is allowed to decay to zero.
Once the parameters on successive frames have been matched, phase continuity of each sinusoidal component is ensured by unwrapping the phase. In one embodiment described in U.S. Ser. No. 712,866, the phase is unwrapped using a cubic phase interpolation function having parameter values that are chosen to satisfy the measured phase and frequency constraints at the frame boundaries while maintaining maximal smoothness over the frame duration.
In the final step of the illustrated embodiment, the corresponding sinusoidal amplitudes are interpolated in a linear manner across each frame.
In speech coding applications, U.S. Ser. No. 712,866 teaches that pitch estimates can be used to establish a set of harmonic frequency bins to which frequency components are assigned. The term "pitch" is used herein to denote the fundamental rate at which a speaker's vocal chords are vibrating. The amplitudes of the components are coded directly using adaptive differential pulse code modulation (ADPCM) across frequency, or indirectly using linear predictive coding (LPC).
In one embodiment of the coder, the peak in each harmonic frequency bin having the largest amplitude is selected and assigned to the frequency at the center of the bin. This results in a harmonic series based upon the coded pitch period. An amplitude envelope can then be constructed by connecting the resulting set of peaks and later sampled in a pitch-adaptive fashion (either linearly or non-linearly) to provide efficient coding at various bit rates. The phases can then be coded by measuring the phases of the edited peaks and then coding such phases using 4 to 5 bits per phase peak. Further details on coding acoustic waveforms in accordance with applicants' sinusoidal analysis techniques can be found in commonly-owned, copending U.S. patent application Ser. No. 034,097, entitled "Coding of Acoustic Waveforms," incorporated herein by reference.
Analysis/synthesis systems constructed according to the invention disclosed in U.S. Ser. No. 712,866, based on a sinusoidal representation of speech, yield synthetic speech that is essentially indistinguishable from the original. Coding techniques as disclosed in U.S. Ser. No. 034,097 have led to the realization of multi-rate coders operating at rates from 2.4 to 9.6 kilobits per second. Such systems produce synthetic speech that is very intelligible at all rates and, in general, produce speech having progressively improving quality as the data rate is increased.
A practical limitation of the sinusoidal technique has been the computational complexity required to perform the sinusoidal synthesis. This complexity results because it is typically necessary to generate each sine wave on a per-sample basis and then sum the resulting set of sine waves. Good performance can be achieved in sinusoidal analysis/synthesis while operating at a 50 Hz frame rate, provided that the sine wave frequencies are matched from frame to frame and that either cubic phase or piece-wise quadratic phase interpolators are used to ensure consistency between the measured frequencies and phases at the frame boundaries. The disadvantage of this approach is the computational overhead associated with the interpolation process. Even if very powerful 125 nanosecond/cycle microprocessors are utilized, such as the ADSP2100 DSP integrated circuits manufactured by Analog Devices (Norwood, Mass.), two such microprocessors typically are required to synthesize 80 sine waves.
An alternative method for performing sinusoidal synthesis includes constructing a set of sine waves having constant amplitudes, frequencies and linearly-varying phases, applying a triangular window of twice the frame size, and then utilizing an overlap-and-add technique in conjunction with the sine waves generated on the previous frame. Such a set of sine waves can also be generated using conventional Fast Fourier Transform (FFT) methods. In this approach, a Fast Fourier Transform (FFT) buffer is filled out with non-zero entries at the sine wave frequencies, an inverse FFT is executed, and then the overlap-and-add technique is applied. This process also leads to synthetic speech that is perceptually indistinguishable from the original, provided the frame rate is approximately 100 Hz (10 ms/frame).
However, for low-rate coding applications, it is necessary to operate at a 50 Hz frame rate (20 ms/frame) or lower. At these frame rates, the FFT overlap-and-add method yields synthetic speech that sounds "rough" because the triangular parametric window is at least 40 ms wide, and this is too long a period compared to the rate of change of the vocal tract and vocal chord articulators.
An apparatus for computationally efficient coding of acoustic waveforms at frame rates of 50 Hz or less, without the "roughness" produced at low coding rates by the above-described methods, would meet a substantial need. In particular, speech processing devices and methods that reduce frame-to-frame discontinuities at low coding rates would be particularly advantageous for coding of speech.
Accordingly, there exists a need for computationally efficient methods and devices for synthesizing sine waves for speech coding, analysis and synthesis systems which operate at low coding rates requiring frame rates of 50 Hz and below. In particular, techniques and apparatus for efficient synthesis of sine waves in connection with sinusoidal transform coding would satisfy long-felt needs and provide substantial contributions to the art.