1. Technical Field
The present application relates generally to speech recognition and, more particularly, to an acoustic signal processing system and method for providing wavelet-based energy binning cepstral features for automatic speech recognition.
2. Description of the Related Art
In general, there are many well-known signal processing techniques which are utilized in speech-based applications, such as speech recognition, for extracting spectral features from acoustic speech signals. The extracted spectral features are used to generate reference patterns (acoustic models) for certain identifiable sounds (phonemes) of the input acoustic speech signals.
Referring now to FIG. 1, a generalized speech recognition system in accordance with the prior art is shown. The speech recognition system 100 generally includes and acoustic front end 102 for preprocessing of speech signals, i.e. input utterances for recognition and training speech. Typically, the acoustic front end 102 includes a microphone to convert the acoustic speech signals into an analog electrical signals having a voltage which varies over time in correspondence to the variations in air pressure caused by the input speech utterances. The acoustic front end also includes an analog-to-digital (A/D) converter for digitizing the analog signal by sampling the voltage of the analog waveform at a desired "sampling rate" and converting the sampled voltage to a corresponding digital value. The sampling rate is typically selected to be twice the highest frequency component (which, e.g., is 16 khz for pure speech or 8 khz for a communication channel having a 4 kz bandwidth).
Digital signal processing is performed on the digitized speech utterances (via the acoustic front end 102) by extracting spectral features to produce a plurality of feature vectors which, typically, represent the envelope of the speech spectrum. Each feature vector is computed for a given frame (or time interval) of the digitized speech, with each frame representing, typically, 10 ms to 30 msec. In addition, each feature vector includes "n" dimensions (parameters) to represent the sound within the corresponding time frame.
The system includes a training module 104 which uses the feature vectors generated by the acoustic front end 102 from the training speech to train a plurality of acoustic models (prototypes) which correspond to the speech baseforms (e.g., phonemes). A decoder 106 uses the trained acoustic models to decode (i.e., recognize) speech utterances by comparing and matching the acoustic models with the feature vectors generated from the input utterances using techniques such as the Hidden Markov Models (HMM) and Dynamic Time Warping (DTW) methods disclosed in "Statistical Methods For Speech Recognition", by Fred Jelinek, MIT Press, 1997, which are well-known by those skilled in the art of speech recognition.
Conventional feature extraction methods for automatic speech recognition generally rely on power spectrum approaches, whereby the acoustic signals are generally regarded as a one dimensional signal with the assumption that the frequency content of the signal captures the relevant feature information. This is the case for the spectrum representation, with its Mel or Bark variations, the cepstrum, FFT-derived (Fast Fourier Transform) or LPC-derived (Linear Predictive Coding), LPC derived features, the autocorrelation, the energy content, and all the associated delta and delta-delta coefficients.
Cepstral parameters are, at present, widely used for efficient speech and speaker recognition. Basic details and justifications can be found in various references: J. R. Deller, J. G. Proakis, and J. H. L. Hansen, "Discrete Time Processing of Speech Signals", Macmillan, New York, N.Y., 1993; S. Furui, "Digital Speech Processing, Synthesis and Recognition", Marcel Dekker, New York, N.Y., 1989; L. Rabiner and B-H. Juang, "Fundamentals of Speech Recognition", Prentice-Hall, Englewood Cliffs, N.J., 1993; and A. V. Oppenheim and S.W. Schaffer, "Digital Signal Processing", Prentice-Hall, Englewood Cliffs, N.J., 1975. Originally introduced to separate the pitch contribution from the rest of the vocal cord and vocal tract spectrum, the cepstrum has the additional advantage of approximating the Karhunen-Loeve transform of speech signal. This property is highly desirable for recognition and classification.
Speech production models, coding methods as well as text to speech technology often lead to the introduction of modulation models to represent speech signals with primary components which are amplitude-and-phase-modulated sine functions. For example, the conventional modulation model (MM) represents speech signals as a linear combination of amplitude and phase modulated components: ##EQU1##
where Ak(t) is the instantaneous amplitude, w.sub.k (t)=d/dt.theta..sub.k (t) is the instantaneous frequency of component (or formant) k, and where N(t) takes into account the errors of modelling. In a more sophisticated model, the components are viewed as "ribbons" in the time-frequency plane rather than curves, and instantaneous bandwidths .DELTA.w.sub.k (t) are associated with each component. These parameters can be extracted and processed to generate feature vectors for speech recognition.
Other methods which characterize speech with phase-derived features are, for example, the EIH (Ensemble Interval Histogram) (see 0. Ghitza, "Auditory Models and Human Performances in Tasks Related to Speech Coding and Speech Recognition", IEEE Trans. Speech Audio Proc., 2(1):pp. 115-132, 1994), SBS (in-synchrony Bands Spectrum) (see 0. Ghitza, "Auditory Nerve Representation Criteria For Speech Analysis/Synthesis", IEEE Trans. ASSP, 6(35):pp 736-740, June 1987), and the IFD (Instantaneous-Frequency Distribution) (see D. H. Friedman, "Instantaneous-Frequency Distribution Vs. Time: An Interpretation of the Phase Structure of Speech", IEEE Proc. ICASSP, pp 1121-1124, 1985). These models are derived from (nonplace/temporal) auditory nerve models of the human auditory nerve system.
In addition, the wavelet transform (WT) is a widely used time-frequency tool for signal processing which has proved to be well adapted for extracting the modulation laws of isolated or substantially distinct primary components. The WT performed with a complex analysis wavelet is known to carry relevant information in its modulus as well as in its phase. The information contained in the modulus is similar to the power spectrum derived parameters. The phase is partially independent of the amplitude level of the input signal. Practical considerations and intrinsic limitations, however, limit the direct application of the WT for speech recognition purposes.
Parellelisms between properties of the wavelet transform of primary components and algorithmic representations of speech signals derived from auditory nerve models like the EIH have led to the introduction of "synchrosqueezing" measures: a novel transformation of the time-scale plane obtained by a quasi-continuous wavelet transform into a time-frequency plane (i.e., synchrosqueezed plane) (see, e.g., "Robust Speech and Speaker Recognition Using Instantaneous Frequencies and Amplitudes Obtained With Wavelet-Derived Synchrosqueezing Measures", Program on Spline Functions and the Theory of Wavelets, Montreal, Canada, March 1996, Centre de Recherches Mathematiques, Universite de Montreal (invited paper). On the other hand, as stated above, in automatic speech recognition, cepstral feature have imposed themselves quasi-universally as acoustic characteristic of speech utterances. The cepstrum can be seen as explicit functions of the formants and other primary components of the modulation model. Two main classes of cepstrum extraction have been intensively used: LPC-derived cepstrum and FFT cepstrum. The second approach has become dominant usually with Mel-binning. Accordingly, a method for extracting spectral features which utilizes these conventional methods for constructing feature vectors which provide increased robustness to speech recognition systems is highly desirable.