This document relates to methods and systems for estimation of speech model parameters.
Speech models together with speech analysis and synthesis methods are widely used in applications such as telecommunications, speech recognition, speaker identification, and speech synthesis. Vocoders are a class of speech analysis/synthesis systems based on an underlying model of speech and have been extensively used in practice. Examples of vocoders include linear predication vocoders, homomorphic vocoders, channel vocoders, sinusoidal transform coders (STC), multiband excitation (MBE) vocoders, improved multiband excitation (IMBE™), and advanced multiband excitation vocoders (AMBE™).
Vocoders typically model speech over a short interval of time as the response of a system excited by some form of excitation. Generally, an input signal s(n) is obtained by sampling an analog input signal. For applications such as speech coding or speech recognition, the sampling rate commonly ranges between 6 kHz and 16 kHz. The method works well for any sampling rate with corresponding changes in the associated parameters. To focus on a short interval centered at time t, the input signal s(n) can be multiplied by a window w(t,n) centered at time t to obtain a windowed signal sw(t,n). The window used is typically a Hamming window or Kaiser window which can have a constant shape as a function of t so that w(t,n)=w(n−t) or can have characteristics which change as a function of t. The length of the window w(t,n) generally ranges between 5 ms and 40 ms. The windowed signal sw(t,n) can be computed at center times of t0, t1, . . . tm, tm+1. Typically, the interval between consecutive center times tm+1 . . . tm approximates the effective length of the window w(t,n) used for these center times. The windowed signal sw(t,n) for a particular center time is often referred to as a segment or frame of the input signal.
For each segment of the input signal, system parameters and excitation parameters are determined. The system parameters typically consist of the spectral envelope or the impulse response of the system. The excitation parameters typically consist of a fundamental frequency (or pitch period) and a voiced/unvoiced (V/UV) parameter which indicates whether the input signal has pitch (or indicates the degree to which the input signal has pitch). For vocoders such as MBE, IMBE, and AMBE, the input signal is divided into frequency bands and the excitation parameters may also include a V/UV decision for each frequency band. High quality speech reproduction may be provided using a high quality speech model, an accurate estimation of the speech model parameters, and high quality synthesis methods.
When the voiced/unvoiced information consists of a single voiced/unvoiced decision for the entire frequency band, the synthesized speech tends to have a “buzzy” quality that is especially noticeable in regions of speech which contain mixed voicing or in voiced regions of noisy speech. A number of mixed excitation models have been proposed as potential solutions to the problem of “buzziness” in vocoders. In these models, periodic and noise-like excitations which have either time-invariant or time-varying spectral shapes are mixed.
In excitation models having time-invariant spectral shapes, the excitation signal consists of the sum of a periodic source and a noise source with fixed spectral envelopes. The mixture ratio controls the relative amplitudes of the periodic and noise sources. Examples of such models are described by Itakura and Saito, “Analysis Syntheses Telephony Based upon the Maximum Likelihood Method,” Reports of 6th Int. Cong Acoust., Tokyo, Japan, Paper C-5-5, pp. C17-20, 1968; and Kwon and Goldberg, “An Enhanced LPC Vocoder with No Voiced/Unvoiced Switch,” IEEE Trans. on Acoust., Speech, and Signal Processing, vol. ASSP-32, no. 4, pp. 851-858, August 1984. In these excitation models, a white noise source is added to a white periodic source. The mixture ratio between these sources is estimated from the height of the peak of the autocorrelation of the LPC residual.
In excitation models having time-varying spectral shapes, the excitation signal consists of the sum of a periodic source and a noise source with time varying spectral envelope shapes. Examples of such models are described by Fujimara, “An Approximation to Voice Aperiodicity,” IEEE Trans. Audio and Electroacoust, pp. 68-72, March 1968; Makhoul et al, “A Mixed-Source Excitation Model for Speech Compression and Synthesis,” IEEE Int. Conf. on Acoust. Sp. & Sig. Proc., April 1978, pp. 163-166; Kwon and Goldberg, “An Enhanced LPC Vocoder with No Voiced/Unvoiced Switch,” IEEE Trans. on Acoust., Speech and Signal Processing, vol. ASSP-32, no. 4, pp. 851-858, August 1984; and Griffin and Lim, “Multiband Excitation Vocoder,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-36, pp. 1223-1235, August 1988.
In the excitation model proposed by Fujimara, the excitation spectrum is divided into three fixed frequency bands. A separate cepstral analysis is performed for each frequency band and a voiced/unvoiced decision for each frequency band is made based on the height of the cepstram peak as a measure of periodicity.
In the excitation model proposed by Makhoul et al., the excitation signal consists of the sum of a low-pass periodic source and a high-pass noise source. The low-pass periodic source is generated by filtering a white pulse source with a variable cut-off low-pass filter. Similarly, the high-pass noise source is generated by filtering a white noise source with a variable cut-off high-pass filter. The cut-off frequencies for the two filters are equal and are estimated by choosing the highest frequency at which the spectrum is periodic. Periodicity of the spectrum is determined by examining the separation between consecutive peaks and determining whether the separations are the same, within some tolerance level.
In a second excitation model implemented by Kwon and Goldberg, a pulse source is passed through a variable gain low-pass filter and added to itself, and a white noise source is passed through a variable gain high-pass filter and added to itself. The excitation signal is the sum of the resultant pulse and noise sources with the relative amplitudes controlled by a voiced/unvoiced mixture ratio. The filter gains and voiced/unvoiced mixture ratio are estimated from the LPC residual signal with the constraint that the spectral envelope of the resultant excitation signal is flat.
In the multiband excitation model proposed by Griffin and Lim, a frequency dependent voiced/unvoiced mixture function is proposed. This model is restricted to a frequency dependent binary voiced/unvoiced decision for coding purposes. A further restriction of this model divides the spectrum into a finite number of frequency bands with a binary voiced/unvoiced decision for each band. The voiced/unvoiced information is estimated by comparing the speech spectrum to the closest periodic spectrum. When the error is below a threshold, the band is marked voiced, otherwise, the band is marked unvoiced.
In U.S. Pat. No. 6,912,495, “Speech Model and Analysis, Synthesis, and Quantization Methods” the multiband excitation model is augmented beyond the time and frequency dependent voiced/unvoiced mixture function to allow a mixture of three different signals. In addition to parameters which control the proportion of quasi-periodic and noise-like signals in each frequency band, a parameter is added to control the proportion of pulse-like signals in each frequency band. In addition to the typical fundamental frequency parameter of the voiced excitation, parameters are included which control one or more pulse amplitudes and positions for the pulsed excitation. This model allows additional features of speech and audio signals important for high quality reproduction to be efficiently modeled.
The Fourier transform of the windowed signal sw(t,n) will be denoted by Sw(t,w) and will be referred to as the signal Short-Time Fourier Transform (STFT). Suppose s(n) is a periodic signal with a fundamental frequency w0 or pitch period n0. The parameters w0 and n0 are related to each other by 2π/w0=n0. Non-integer values of the pitch period n0 are often used in practice.
A speech signal s(n) can be divided into multiple frequency bands or channels using bandpass filters. Characteristics of these bandpass filters are allowed to change as a function of time and/or frequency. A speech signal can also be divided into multiple bands by applying frequency windows or weightings to the speech signal STFT Sw(t,w).