In speech synthesis, the speech pattern is usually divided into frames of a few tens of ms. Conventional vocoders identify stationary speech phonemes into voiced and unvoiced speech (see, e.g., U.S. Pat. Nos. 2,151,091 and 2,243,526). In the art, a single phoneme is regarded to evolve on the order of magnitude of 100 ms. On each of the smaller windows, a linear filter--or digital circuit--is then designed to model the vocal tract. A voiced speech signal can be modeled, and regenerated, as a deterministic process obtained by passing a quasi-periodic signal containing the appropriate pitch frequencies through a linear filter. Analogously, an unvoiced speech signal is modeled, and regenerated, as a random signal obtained by passing white noise through the same linear filter, which models the vocal chords. In this time frame, the parameters characterizing the linear filter as an input/output device are identified using, for example, methods from linear prediction coding (LPC) filter design, and encoded for regeneration. For applications in cellular telephone communications using pitch-excited vocoders, in this same window, the speech pattern is segmented into an identified sequence, also encoded for regeneration, of voiced and unvoiced phonemes. In some popular forms of vocoders, for each unvoiced expression a code book, or look-up table, of white noise signals is searched for that signal which, when passed through the LPC filter, regenerated the response closest to the sampled unvoiced signal. The code for this signal is then transmitted for regeneration. A similar procedure is performed for voiced signals with a periodic pulse train signals in lieu of white noise. Here, however, the vocoder must also perform pitch detection in order to regenerate the voiced signal.
Linear Predictive Coding (LPC) can be used in a variety of different speech coders, such as pitch-excited vocoders, voice-excited vocoders, waveform coders, analysis-by-synthesis coders, and frequency-domain coders (see T. P. Barnwell III, K. Nayebi and C. H. Richardson, Speech Coding: A Computer Laboratory Textbook, John Wiley & Sons, New York, 1996, at 85), and the invention disclosed herein can be used in all these contexts and is not confined to a particular vocoder architecture. In fact, LPC filters, sometimes referred to as maximum entropy filters, in devices for such digital signal processing and speech synthesis (see, e.g., U.S. Pat. Nos. 4,209,836 and 5,048,088; D. Quarmby, Signal Processing Chips, Prentice Hall, 1994; and L. R. Rabiner, B. S. Atal, and J. L. Flanagan, Current Methods of Digital Speech Processing, Selected Topics in Signal Processing, S. Haykin, editor, Prentice Hall, 1989, 112-32) have been used in the prior art.
In applications to automatic speaker recognition a person's identity is determined from a voice sample. This class of problems comes in two types, namely speaker verification and speaker identification. In speaker verification, the person to be identified claims an identity, for example by presenting a personal smart card, and then speaks into an apparatus that will confirm or deny this claim. In speaker identification, on the other hand, the person makes no claim about his identity, does not present a smart card, and the system must decide the identity of the speaker, individually or as part of a group of enrolled people, or decide whether to classify the person as unknown. Common for both applications is that each person to be identified must first enroll into the system. The enrollment (or training) is a procedure in which the person's voice is recorded and the characteristic features are extracted and stored. A feature set which is commonly used in the art is the LPC coefficients for each frame of the speech signal, or some (nonlinear) transformation of these (see e.g. J. M. Naik, Speaker Verification: A tutorial, IEEE Communications Magazine, January 1990, 42-48, at p. 43; J. P. Campbell Jr., Speaker Recognition: A tutorial, Proceedings of the IEEE 85 (1997), 1436-1462; S. Furui, Recent advances in Speaker Recognition, Lecture Notes in Computer Science 1206, 1997, 237-252, Springer-Verlag, at p. 239.)
The circuit, or integrated circuit device, which implements the LPC filter is designed and fabricated using ordinary skill in the art of electronics (see, e.g., U.S. Pat. Nos. 4,209,836 and 5,048,088) on the basis of the specified parameters (specs) which appear as coefficients (linear prediction coefficients) in the mathematical description (transfer function) of the LPC filter. For example, the expression of the specified parameters (specs) is often conveniently displayed in the lattice filter representation of the circuit shown in FIG. 1, containing unit delays z.sup.-1, summing junctions, and gains.
This is also known as a PARCOR system. The gain (PARCOR) parameters, which are also the reflection coefficients of the random signal (as in FIG. 1), are easily determined from the speech waveform. The design of the associated circuit is immediate with ordinary skill in the art of electronics. In fact, this filter design has been fabricated by Texas Instruments, starting from the lattice filter representation, and is used in the LPC speech synthesizer chips TMS 5100, 5200, 5220 (see, e.g., Quarmby, Signal Processing Chips, supra, at 27-29).
The two advantages of LPC filter design are that it is possible to find parameter specs so that the LPC filter produces a signal which reproduces much of the observed spectral properties, and that there exists algorithms for finding the filter parameters from the spectral properties of the observed speech form. FIG. 2 shows a periodogram determined from a frame of speech data together with the power spectral density of a 6th order LPC filter designed from this frame.
A disadvantage of the LPC filter is that its power spectral density cannot match the "valleys," or "notches," in the periodogram and results in speech which is rather "flat," reflecting the fact that the LPC filter is an "all-pole model." This is related to the technical fact that the LPC filter only has poles and has no transmission zeros. To say that a filter has a transmission zero at a frequency .zeta. is to say the filter, or corresponding circuit, will absorb damped periodic signals which oscillate at a frequency equal to the phase of .zeta. and with a damping factor equal to the modulus of .zeta.. This is the well-known blocking property of transmission zeros of circuits (see, e.g., L. O. Chua, C. A. Desoer and E. S. Kuh, Linear and Nonlinear Circuits, McGraw-Hill, 1989, at 659). This technical fact is reflected in the fact, illustrated in FIG. 2, that the power spectral density of the LPC filter will not match the periodogram at frequencies near its notches. It is also widely appreciated in the signal and speech processing literature that regeneration of human speech requires the design of filters having zeros, without which the speech will sound flat or artificial (see, e.g., C. G. Bell, H. Fujisaki, J. M. Heinz, K. N. Stevens and A. S. House, Reduction of Speech Spectra by Analysis-by-Synthesis Techniques, J. Acoust. Soc. Am. 33 (1961), at 1726; J. D. Markel and A. H. Gray, Linear Prediction of Speech, Springer Verlag, Berlin, 1976, at 271-72; L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, N.J., 1978, at 105 and 76-78). Indeed, while all-pole filters can reproduce much of human speech sounds, the acoustic theory teaches that nasals and fricatives require both zeros and poles (see, Markel et al., Linear Prediction of Speech, supra, at 271-72; Rabiner et al., Digital Processing of Speech Signals, supra, at 105, J. P. Campbell Jr., Speaker Recognition: A tutorial, supra at 1442).
As it relates to speech synthesis, this observation is a partial motivation for the device disclosed in U.S. Pat. No. 5,293,448, in which a zero filter is used as a prefilter to the all pole filter to generate higher quality voiced signals. However, the lack of a clear and useful delineation of the extent to which zeros may be arbitrarily assigned to implementable for linear filters both voiced and unvoiced speech has remained a limiting factor in the design of improved devices for signal and speech processing.