1. Field of the Invention
This invention relates to speech recognition and more particularly relates to combining matrix and vector quantization with selective enhanced processing and neural network postprocessing to provide efficient and robust speech recognition.
2. Description of the Related Art
Speech is perhaps the most important communication method available to mankind. It is also a natural method for man-machine communication. Man-machine communication by voice offers a whole new range of information/communication services which can extend man""s capabilities, serve his social needs, and increase his productivity. Speech recognition is a key element in establishing man-machine communication by voice, and, as such, speech recognition is an important technology with tremendous potential for widespread use in the future.
Voice communication between man and machine benefits from an efficient speech recognition interface. Speech recognition interfaces are commonly implemented as Speaker-Dependent (SD)/Speaker-Independent (SI) Isolated Word Speech Recognition (IWSR)/continuous speech recognition (CSR) systems. The SD/SI IWSR/CSR system provides, for example, a beneficial voice command interface for hands free telephone dialing and interaction with voice store and forwarding systems. Such technology is particularly useful in an automotive environment for safety purposes.
However, to be useful, speech recognition must generally be very accurate in correctly recognizing (classifying) an input signal with a satisfactory probability of accuracy. Difficulty in correct recognition arises particularly when operating in an acoustically noisy environment Recognition accuracy may be severely, unfavorably impacted under realistic environmental conditions where speech is corrupted by various levels of acoustic noise.
FIG. 1 generally characterizes a speech recognition process by the speech recognition system 100. A microphone transducer 102 picks up an input signal 101 and provides to signal preprocessor 104 an electronic signal representation of input signal 101. The input signal 101 is an acoustic waveform of a spoken input, typically a word, or a connecting string of words. The signal preprocessor 104 may, for example, filter the input signal 101, and a feature extractor 106 extracts selected information from the input signal 101 to characterize the signal using, for example, cepstral frequencies or line spectral pair frequencies (LSPs).
Referring to FIG. 2, feature extraction in operation 106 is basically a data-reduction technique whereby a large number of data points (in this case samples of the input signal 101 recorded at an appropriate sampling rate) are transformed into a smaller set of features which are xe2x80x9cequivalentxe2x80x9d, in the sense that they faithfully describe the salient properties of the input signal 101. Feature extraction is generally based on a speech production model which typically assumes that the vocal tract of a speaker can be represented as the concatenation of lossless acoustic tubes (not shown) which, when excited by excitation signals, produce a speech signal. Samples of the speech waveform are assumed to be the output of a time-varying filter that approximates the transmission properties of the vocal tract. It is reasonable to assume that the filter has fixed characteristics over a time interval on the order of 10 to 30 milliseconds. The, short-time samples of input signal 101 may be represented by a linear, time-invariant all pole filter designed to model the spectral envelope of the input signal 101 in each time flame. The filter may be characterized within a given interval by an impulse response and a set of coefficients.
Feature extraction in operation 106 using linear predictive (LP) speech production models has become the predominant technique for estimating basic speech parameters such as pitch, formants, spectra, and vocal tract area functions. The LP model allows for linear predictive analysis which basically approximates input signal 101 as a linear combination of past speech samples. By minimizing the sum of the squared differences (over a finite interval) between actual speech samples and the linearly predicted ones, a unique set of prediction filter coefficients can be determined. The predictor coefficients are weighting coefficients used in the linear combination of past speech samples. The LP coefficients are generally updated very slowly with time, for example, every 10-30 milliseconds, to represent the changing states of the vocal tract. LP prediction coefficients are calculated using a variety of well-known procedures, such as autocorrelation and covariance procedures, to minimize the difference between the actual input signal 101 and a predicted input signal 101. The LP prediction coefficients are often stored as a spectral envelope reference pattern and can be easily transformed into several different representations including cepstral coefficients and line spectrum pair (LSP) frequencies. Details of LSP theory can be found in N. Sugamura, xe2x80x9cSpeech Analysis and Synthesis Methods Developed at ECL in NTT-from LPC to LSPxe2x80x9d, Speech Communication 5, Elsevier Science Publishers, B. V., pp. 199-215 (1986).
Final decision-logic classifier 108 utilizes the extracted feature information to classify the represented input signal 101 to a database of representative input signal 101. Speech recognition classifying problems can be treated as a classical pattern recognition problem. Fundamental ideas from signal processing, information theory, and computer science can be utilized to facilitate isolated word recognition and simple connected-word sequences recognition.
FIG. 2 illustrates a more specific speech recognition system 200 based on pattern recognition as used in many IWSR type systems. The extracted features representing input signal 101 are segmented into short-term input signal 101 frames and considered to be stationary within each frame for 10 to 30 msec duration. The extracted features may be represented by a D-dimensional vector and compared with predetermined, stored reference patterns 208 by the pattern similarity operation 210. Similarity between the input signal 101 pattern and the stored reference patterns 208 is determined in pattern similarity operation 210 using well-known vector quantization processes. The vector quantization process yields spectral distortion or distance measures to quantify the score of fitness or closeness between the representation of input signal 101 and each of the stored reference patterns 208.
The decision rule operation 212 receives the distance measures and determines which of the reference patterns 208 the input signal 101 most closely represents. In a xe2x80x9chardxe2x80x9d decision making process, input signal 101 is matched to one of the reference patterns 208. This one-to-one xe2x80x9chard decisionxe2x80x9d ignores the relationship of the input signal 101 to all the other reference patterns 208. Fuzzy methods have been introduced to provide a better match between vector quantized frames of input signal 101 and reference patterns 208. In a xe2x80x9csoftxe2x80x9d or xe2x80x9cfuzzyxe2x80x9d decision making process, input signal 101 is related to one or more reference patterns 208 by weighting coefficients.
Matrix quantization has also been used to introduce temporal information about input signal 101 into decision rule operation 212. Fuzzy analysis methods have also been incorporated into matrix quantization processes, as described in Xydeas and Cong, xe2x80x9cRobust Speech Recognition In a Car Environmentxe2x80x9d, Proceeding of the DSP95 International Conference on Digital Signal Processing, Jun. 26-28, 1995, Limassol, Cyprus. Fuzzy matrix quantization allows for xe2x80x9csoftxe2x80x9d decisions using interframe information related to the xe2x80x9cevolutionxe2x80x9d of the short-term spectral envelopes of input signal 101.
Despite conventional speech recognition achievements, research and development continues to focus on more efficient speech recognition systems with higher speech recognition accuracy.
In one embodiment, vector quantization operates on a single frame of input signal frequency parameters and, at least generally, does not incorporate temporal signal information into the vector quantization operation. However, vector quantization performs particularly well when temporal information is scarce or non-existent, such as with short input signal duration. Matrix quantization operates on multiple input signal frames and, thus, utilizes both temporal and frequency information about the input signal. However, errors may be introduced into matrix quantization operations when operating on a short duration input signal. Thus, although matrix quantization generally leads to a higher recognition accuracy than vector quantization, vector quantization can compensate for matrix quantization errors that may occur when operating on brief duration input signals. Additionally, signal features may be divided or split by, for example, frequency subbands to allow for differential processing to, for example, target enhanced processing on greater more affected subbands. Split matrix and split vector quantization techniques may be used to more efficiently and more accurately classify the input signal. Furthermore, additional speech classifiers such as hidden Markov models may be trained and their stochastic output data may serve as input data to a further speech classifier such as a neural network. Respective hidden Markov models may be designed using quantization data as the observation sequences and a probability algorithm such as the Viterbi algorithm to determine likelihood probabilities.
In one embodiment, a new hybrid speech recognition system combines Matrix Quantization (MQ) and Vector Quantization (VQ) with Hidden Markov Models (HMMs) and neural network postprocessing to efficiently utilize processing resources and improve speech recognition performance. This MQ/HMM/NN_VQ/HMM/NN system exploits the xe2x80x9cevolutionxe2x80x9d of speech short-term spectral envelopes with error compensation from VQ/HMM processes. Additionally, the neural network, which in one embodiment is a multi-layer perception type neural network, further enhances recognition accuracy. Acoustic noise may affect particular frequency domain subbands. In one embodiment, split matrix and split vector quantizers exploit localized noise by efficiently allocating enhanced processing technology to target noise-affected input signal parameters and minimize noise influence. The enhanced processing technology employs, for example, a weighted LSP and signal energy related distance measure in a LBG algorithm. In another embodiment, matrix and vector quantizers are utilized to process incoming speech data without splitting frequency subbands. In another embodiment, a variety of input data may be provided to the neural network to efficiently maximize recognition accuracy. In a further embodiment, xe2x80x98hardxe2x80x99 decisions, i.e., non-fuzzy decisions, are utilized by the respective quantizers to reduce processing resource demand while continuing to use other enhanced recognition resources to achieve high percentage speech recognition accuracy.
In one embodiment, multiple speech processing subsystems are employed to provide initial quantization data to respective speech classifiers. Output data from the speech classifiers may be combined in such a way to compensate for quantization errors introduced by the speech processing subsystems. In another embodiment, one of the speech processing subsystems includes a vector quantizer which provides quantization information to a speech classifier having hidden Markov models. Another speech processing subsystem includes a matrix quantizer which provides quantization information to another speech classifier having hidden Markov models. Output data from the respective hidden Markov models respectively associated with the vector and matrix quantizers may be mixed using any of a variety of criteria and provided to a neural network for enhanced recognition accuracy speech classifiers.
In another embodiment of the present invention, a speech recognition system includes a vector quantizer to receive first parameters of an input signal and to generate a first quantization observation sequence and a first speech classifier to receive the first quantization observation sequence from the vector quantizer and to generate first respective speech classification output data. The speech recognition system further includes a matrix quantizer to receive second parameters of the input signal and to generate a second quantization observation sequence, a second speech classifier to receive the second quantization observation sequence from the matrix quantizer and to generate second respective speech classification output data; and a mixer to combine corresponding first and second respective speech classification data to generate third respective speech classification data and to generate output data from the first, second, and third speech classification data. The speech recognition system also includes a neural network to receive output data from the mixer and to determine fourth respective speech classification output data.
In another embodiment of the present invention, a method includes the steps of processing first parameters of the input signal to relate the first parameters to first reference data wherein the first parameters include frequency and time domain information, generating first output data relating the first parameters to reference data, and processing second parameters of the input signal to relate the second parameters to second reference data wherein the second parameters include frequency domain information. The method further includes the steps of generating second output data relating the second parameters to the second reference data, combining the first output data and second output data into third output data to compensate for errors in the first output data, and providing the first, second, and third output data to a neural network to classify the input signal.