1. Field of the Invention
This invention relates to speech recognition and more particularly relates to combining matrix and vector quantization to provide robust speech recognition.
2. Description of the Related Art
Speech is perhaps the most important communication method available to mankind. It is also a natural method for man-machine communication. Man-machine communication by voice offers a whole new range of information/communication services which can extend man's capabilities, serve his social needs, and increase his productivity. Speech recognition is a key element in establishing man-machine communication by voice, and, as such, speech recognition is an important technology with tremendous potential for widespread use in the future.
Voice communication between man and machine benefits from an efficient speech recognition interface. Speech recognition interfaces are commonly implemented as Speaker-Dependent (SD)/Speaker-Independent (SI) Isolated Word Speech Recognition (IWSR)/continuous speech recognition (CSR) systems. The SD/SI IWSR/CSR system provides, for example, a beneficial voice command interface for hands free telephone dialing and interaction with voice store and forwarding systems. Such technology is particularly useful in an automotive environment for safety purposes.
However, to be useful, speech recognition must generally be very accurate in correctly recognizing (classifying) the input signal 101 with a satisfactory probability of accuracy. Difficulty in correct recognition arises particularly when operating in an acoustically noisy environment. Recognition accuracy may be severely and unfavorably impacted under realistic environmental conditions where speech is corrupted by various levels of acoustic noise.
FIG. 1 generally characterizes a speech recognition process by the speech recognition system 100. A microphone transducer 102 picks up an input signal 101 and provides to signal preprocessor 104 an electronic signal representation of the composite input signal 101. The input signal 101 is an acoustic waveform of a spoken input, typically a word, or a connecting string of words. The signal preprocessor 104 may, for example, filter the input signal 101, and a feature extractor 106 extracts selected information from the input signal 101 to characterize the signal with, for example, cepstral frequencies or line spectral pair frequencies (LSPs).
Referring to FIG. 2, more specifically, feature extraction in operation 106 is basically a data-reduction technique whereby a large number of data points (in this case samples of the input signal 101 recorded at an appropriate sampling rate) are transformed into a smaller set of features which are "equivalent", in the sense that they faithfully describe the salient properties of the input signal 101. Feature extraction is generally based on a speech production model which typically assumes that the vocal tract of a speaker can be represented as the concatenation of lossless acoustic tubes (not shown) which, when excited by excitation signals, produces a speech signal. Samples of the speech waveform are assumed to be the output of a time-varying filter that approximates the transmission properties of the vocal tract. It is reasonable to assume that the filter has fixed characteristics over a time interval of the order of 10 to 30 milliseconds (ms). Thus, short-time input signal 101 portion of input signal 101 may be represented by a linear, time-invariant all pole filter designed to model the spectral envelope of the signal in each time frame. The filter may be characterized within a given interval by an impulse response and a set of coefficients.
Feature extraction in operation 106 using linear predictive (LP) speech production models has become the predominant technique for estimating basic speech parameters such as pitch, formants, spectra, and vocal tract area functions. The LP model allows for linear predictive analysis which basically approximates an input signal 101 as a linear combination of past speech samples. By minimizing the sum of the squared differences (over a finite interval) between actual speech samples and the linearly predicted ones, a unique set of prediction filter coefficients can be determined. The predictor coefficients are weighting coefficients used in the linear combination of past speech samples. The LP coefficients are generally updated very slowly with time, for example, every 10-30 ms, to represent the changing vocal tract. LP prediction coefficients are calculated using a variety of well-known procedures, such as autocorrelation and covariance procedures, to minimize the difference between the actual input signal 101 and a predicted input signal 101 often stored as a spectral envelope reference pattern. The LP prediction coefficients can be easily transformed into several different representations including cepstral coefficients and line spectrum pair (LSP) frequencies. Details of LSP theory can be found in N. Sugamura, "Speech Analysis and Synthesis Methods Developed at ECL in NTT-from LPC to LSP", Speech Communication 5, Elsevier Science Publishers, B. V., pp. 199-215 (1986).
Final decision-logic classifier 108 utilizes the extracted information to classify the represented input signal 101 to a database of representative input signal 101. Speech recognition classifying problems can be treated as a classical pattern recognition problem. Fundamental ideas from signal processing, information theory, and computer science can be utilized to facilitate isolated word recognition and simple connected-word sequences recognition.
FIG. 2 illustrates a more specific speech recognition system 200 based on pattern recognition as used in many IWSR type systems. The extracted features representing input signal 01 are segmented into short-term input signal 101 frames and considered to be stationary within each frame for 10 to 30 msec duration. The extracted features may be represented by a P-dimensional vector and compared with predetermined, stored reference patterns 208 by the pattern similarity operation 210. Similarity between the input signal 101 pattern and the stored reference patterns 208 is determined in pattern similarity operation 210 using well-known vector quantization processes. The vector quantization process yields spectral distortion or distance measures to quantify the score of fitness or closeness between the representation of input signal 101 and each of the stored reference patterns 208.
The decision rule operation 212 receives the distance measures and determines which of the reference patterns 208 the input signal 101 most closely represents. In a "hard" decision making process, input signal 101 is matched to one of the reference patterns 208. This one-to-one "hard decision" ignores the relationship of the input signal 101 to all the other reference patterns 208. Fuzzy methods have been introduced to provide a better match between vector quantized frames of input signal 101 and reference patterns 208. In a "soft" or "fuzzy" decision making process, input signal 101 is related to one or more reference patterns 208 by weighting coefficients.
Matrix quantization has also been used to introduce temporal information about input signal 101 into decision rule operation 212. Fuzzy analysis methods have also been incorporated into matrix quantization processes, as described in Xydeas and Cong, "Robust Speech Recognition In a Car Environment", Proceeding of the DSP95 International Conference on Digital Signal Processing, Jun. 26-28, 1995, Limassol, Cyprus. Fuzzy matrix quantization allows for "soft" decision using interframe information related to the "evolution" of the short-term spectral envelopes of input signal 101.
Despite conventional speech recognition progress, research and developement continues to focus on more efficient speech recognition systems with higher speech recognition accuracy.