Speech is traditionally modeled in a manner that mimics the human vocal tract. Such traditional models view speech as originating from two excitation signals: a voiced speech excitation signal and an unvoiced excitation speech signal. These two excitation signals can be convolved by a filter to produce a resulting synthesized speech signal. FIG. 1 illustrates synthesis in the traditional speech model. The voice excitation signal 12 and unvoiced excitation signal 14 are applied to a LPC filter 10 to produce synthetic speech 16.
For the purposes of convenience, models of speech analysis and synthesis are generally represented as mathematical formulas. In particular, the voiced excitation signal, the unvoiced excitation signal, and the resulting speech signal are often each represented as series of time varying samples of their respective analog waveforms. The filter in turn, is viewed as a transform that operates upon the series of samples. A frequency domain representation of the filter can be obtained by using a z transform When such a z transform is employed, the filter can usually be represented as a transfer function, H(z) This transfer function equals the z transform of the output signal, Y(z), divided by the z transform of the input signal, X(z) In equation form, the transfer function can be represented as EQU H(z)=Y(z)/X(z)
where
Y(z)=z transform of the output signal;
X(z)=z transform of the input signal.
The z transform of the input signal and the z transform of the output signal can be represented as polynomials. The resulting transfer function H(z) can be represented as the product of factors of polynomials. In particular, when so represented ##EQU1## where M,N=lengths of the respective sequences;
The roots of the factors of the numerator are known as zeroes, and the roots of the factors of the denominator are known as poles
Filters may be used to obtain a parametric representation of the speech signal, as opposed to a representation that attempts to duplicate the analog waveform of the speech signal. Linear Predictive Coding (LPC) is one technique of obtaining such a parametric representation. LPC speech synthesis as originally devised sought to operate on two separate excitation signals. The first excitation signal represented the voiced speech component and had only a single pulse per every pitch period. The other excitation signal represented unvoiced speech and was not limited with regard to number of pulses per pitch period. In fact, the second unvoiced excitation signal typically had several pulses per a pitch period.
One of the primary difficulties with the traditional single pulse model for LPC when applied to voiced speech was that it made a simplified assumption that there is only one pulse per pitch period in voiced speech. It is, however, known that there is generally secondary excitation per pitch period in voiced speech. The resulting synthesized speech from filters devised under this traditional model have proven to be unnatural sounding because of the inaccuracy of the model. In response to this problem. Atal and Remde proposed an LPC model that (operated on multiple pulses of speech per pitch period that accounted for the secondary excitation. This model has become known as the multipulse model.
The multipulse model makes no a priori assumption about the nature of the excitation signal. Each frame of speech is modeled by its LPC filter and a fixed number of pulses. As a result, a critical estimate of the pitch period of the excitation signal is no longer necessary as required in the single model. The result of Atal and Remde's innovation has been a model and filters that produce more natural sounding speech.
The multipulse model has typically employed an all-poles LPC filter. Such a filter, however, performs poorly when the modeled voiced segment is a mixture of minimum and non-minimum phase characteristics. In order to attempt to remedy this problem, pole-zero filters have been substituted for the all-poles LPC filters.