1. Field of the Invention
This invention relates to speech recognition and more particularly relates to determining and providing frequency and mean compensated frequency input data to respective quantizer(s) and backend processors to provide efficient and robust speech recognition.
2. Description of the Related Art
Speech is perhaps the most important communication method available to mankind. It is also a natural method for man-machine communication. Man-machine communication by voice offers a whole new range of information/communication services which can extend man""s capabilities, serve his social needs, and increase his productivity. Speech recognition is a key element in establishing man-machine communication by voice, and, as such, speech recognition is an important technology with tremendous potential for widespread use in the future.
Voice communication between man and machine benefits from an efficient speech recognition interface. Speech recognition interfaces are commonly implemented as Speaker-Dependent (SD)/Speaker-Independent (SI) Isolated Word Speech Recognition (IWSR)/continuous speech recognition (CSR) systems. The SD/SI IWSR/CSR system provides, for example, a beneficial voice command interface for hands free telephone dialing and interaction with voice store and forwarding systems. Such technology is particularly useful in an automotive environment for safety purposes.
However, to be useful, speech recognition must generally be very accurate in correctly recognizing (classifying) an input signal with a satisfactory probability of accuracy. Difficulty in correct recognition arises particularly when operating in an acoustically noisy environment. Recognition accuracy may be severely, unfavorably impacted under realistic environmental conditions where speech is corrupted by various levels of acoustic noise.
FIG. 1 generally characterizes a speech recognition process by the speech recognition system 100. A microphone transducer 102 picks up an input signal 101 and provides to signal preprocessor 104 an electronic signal representation of input signal 101. The input signal 101 is an acoustic waveform of a spoken input, typically a word, or a connecting string of words. The signal preprocessor 104 may, for example, filter the input signal 101, and a feature extractor 106 extracts selected information from the input signal 101 to characterize the signal using, for example, cepstral frequencies or line spectral pair frequencies (LSPs).
Referring to FIG. 2, feature extraction in operation 106 is basically a data-reduction technique whereby a large number of data points (in this case samples of the input signal 101 recorded at an appropriate sampling rate) are transformed into a smaller set of features which are xe2x80x9cequivalentxe2x80x9d, in the sense that they faithfully describe the salient properties of the input signal 101. Feature extraction is generally based on a speech production model which typically assumes that the vocal tract of a speaker can be represented as the concatenation of lossless acoustic tubes (not shown) which, when excited by excitation signals, produce a speech signal. Samples of the speech waveform are assumed to be the output of a time-varying filter that approximates the transmission properties of the vocal tract. It is reasonable to assume that the filter has fixed characteristics over a time interval on the order of 10 to 30 milliseconds. The, short-time samples of input signal 101 may be represented by a linear, time-invariant all pole filter designed to model the spectral envelope of the input signal 101 in each time frame. The filter may be characterized within a given interval by an impulse response and a set of coefficients.
Feature extraction in operation 106 using linear predictive (LP) speech production models has become the predominant technique for estimating basic speech parameters such as pitch, formats, spectra, and vocal tract area functions. The LP model allows for linear predictive analysis which basically approximates input signal 101 as a linear combination of past speech samples. By minimizing the sum of the squared differences (over a finite interval) between actual speech samples and the linearly predicted ones, a unique set of prediction filter coefficients can be determined. The predictor coefficients are weighting coefficients used in the linear combination of past speech samples. The LP coefficients are generally updated very slowly with time, for example, every 10-30 milliseconds, to represent the changing states of the vocal tract. LP prediction coefficients are calculated using a variety of well-known procedures, such as autocorrelation and covariance procedures, to minimize the difference between the actual input signal 101 and a predicted input signal 101. The LP prediction coefficients are often stored as a spectral envelope reference pattern and can be easily transformed into several different representations including cepstral coefficients and line spectrum pair (LSP) frequencies. Details of LSP theory can be found in N. Sugamura, xe2x80x9cSpeech Analysis and Synthesis Methods Developed at ECL in NTT-from LPC to LSPxe2x80x9d, Speech Communication 5, Elsevier Science Publishers, B. V., pp. 199-215 (1986).
Final decision-logic classifier 108 utilizes the extracted feature information to classify the represented input signal 101 to a database of representative input signal 101. Speech recognition classifying problems can be treated as a classical pattern recognition problem. Fundamental ideas from signal processing, information theory, and computer science can be utilized to facilitate isolated word recognition and simple connected-word sequences recognition.
FIG. 2 illustrates a more specific speech recognition system 200 based on pattern recognition as used in many IWSR type systems. The extracted features representing input signal 101 are segmented into short-term input signal 101 frames and considered to be stationary within each frame for 10 to 30 msec duration. The extracted features may be represented by a D-dimensional vector and compared with predetermined, stored reference patterns 208 by the pattern similarity operation 210. Similarity between the input signal 101 pattern and the stored reference patterns 208 is determined in pattern similarity operation 210 using well-known vector quantization processes. The vector quantization process yields spectral distortion or distance measures to quantify the score of fitness or closeness between the representation of input signal 101 and each of the stored reference patterns 208.
The decision rule operation 212 receives the distance measures and determines which of the reference patterns 208 the input signal 101 most closely represents. In a xe2x80x9chardxe2x80x9d decision making process, input signal 101 is matched to one of the reference patterns 208. This one-to-one xe2x80x9chard decisionxe2x80x9d ignores the relationship of the input signal 101 to all the other reference patterns 208. Fuzzy methods have been introduced to provide a better match between vector quantized frames of input signal 101 and reference patterns 208. In a xe2x80x9csoftxe2x80x9d or xe2x80x9cfuzzyxe2x80x9d decision making process, input signal 101 is related to one or more reference patterns 208 by weighting coefficients.
Matrix quantization has also been used to introduce temporal information about input signal 101 into decision rule operation 212. Fuzzy analysis methods have also been incorporated into matrix quantization processes, as described in Xydeas and Cong, xe2x80x9cRobust Speech Recognition In a Car Environmentxe2x80x9d, Proceeding of the DSP95 International Conference on Digital Signal Processing, Jun. 26-28, 1995, Limassol, Cyprus. Fuzzy matrix quantization allows for xe2x80x9csoftxe2x80x9d decisions using interframe information related to the xe2x80x9cevolutionxe2x80x9d of the short-term spectral envelopes of input signal 101.
Input signal corruption by acoustical noise has long been responsible for difficulties in input signal recognition accuracy. However, in many environments, such as in a traveling automobile environment, such noise is generally always present and presents an impediment to conventional speech recognition techniques. Despite conventional speech recognition achievements, research and development continues to focus on more efficient speech recognition systems with higher speech recognition accuracy.
In some acoustically noisy environments, the acoustical noise remains generally constant in the frequency domain especially over relatively short periods of time such as the time duration of a single spoken word. Such noise characterization is particularly true in, for example, an automobile environment.
Speech recognition processing characterizes a speech input signal with various frequency parameters such as LSPs. In an acoustically noisy environment, the LSPs will also generally be corrupted by noise. However, when noise frequency spectra is generally constant, input signal frequency parameter noise corruption may be minimized by comparing each ith frequency parameter, as derived from one of TO sampled frames of a speech input signal, to a corresponding ith mean frequency parameter derived from all the ith frequency parameters from all TO frames.
For example, in a vehicular environment, such as an automobile or airplane, noise is generally constant in the frequency domain during the duration of a spoken word input signal. Thus, when each sampled frame of the input signal is characterized by D order frequency parameters, the mean frequency parameter for the ith order over TO frames is {overscore (si+L )} where:             s      i        _    =            1      TO        ⁢                  ∑                  j          =          1                TO            ⁢              xe2x80x83            ⁢                        s          i                ⁡                  (          j          )                    
where j=1, 2, . . . , TO and i=1, 2, . . . , D. A mean compensated ith frequency parameter from the jth frame, si(m)(j), equals si(j) minus {overscore (si+L )}, where si(j) is the ith frequency parameter for the jth frame. Thus, to the extent that the noise frequency contribution affecting the ith frequency parameter remains generally constant over TO frames, the mean compensated frequency parameters for each frame of the input signal are generally noise free.
Mean compensated frequency parameters such as LSP coefficients may be used as input data to an input signal classifier. For example, in one embodiment, a matrix quantizer and a vector quantizer receive mean compensated LSP coefficients during training and recognition. Furthermore, in another embodiment multiple matrix and vector quantizers may be used with one matrix and vector quantizer pair receiving mean compensated LSP coefficients and a second pair receiving LSP coefficients to, for example, apply the same or dissimilar processing techniques to enhance recognition accuracy. Furthermore, back end classifiers such as hidden Markov models and/or neural networks may also be employed to enhance recognition accuracy.
In one embodiment, when employing multiple matrix and vector quantizer sets with respective hidden Markov model back end classifier groups and providing mean compensated LSP coefficients to one group of matrix and vector quantizers and non-mean compensated LSP coefficients to another such group, the close similarity between initial state probabilities and state transition probabilities for HMMs receiving data from a common quantizer type may be capitalized upon. For example, during training or estimation of hidden Markov models for the nth vocabulary word, for each quantizer, only the probability distribution need be determined and stored in a memory of a computer system independently for each type of quantizer e.g. matrix quantizer types and vector quantizer types. Alternatively, separate hidden Markov models may be independently developed for each quantizer. During recognition, a maximum likelihood algorithm, such as the Viterbi algorithm, may be used to determine the respective probabilities that each hidden Markov model generated a particular quantizer provided observation sequence may be modified to also capitalize on the close similarities of initial state probabilities and state transition probabilities for each type of quantizer.
In one embodiment, matrix quantizers are used in conjunction with vector quantizers to improve recognition accuracy. Vector quantization operates on a single frame of input signal frequency parameters and, at least generally, does not incorporate temporal signal information into the vector quantization operation. However, vector quantization performs particularly well when temporal information is scarce or non-existent, such as with short input signal duration. Matrix quantization operates on multiple input signal frames and, thus, utilizes both temporal and frequency information about the input signal. However, errors may be introduced into matrix quantization operations when operating on a short duration input signal. Thus, although matrix quantization generally leads to a higher recognition accuracy than vector quantization, vector quantization can compensate for matrix quantization errors that may occur when operating on brief duration input signals.
Additionally, signal features may be divided or split by, for example, frequency subbands to allow for differential processing to, for example, target enhanced processing on subbands more affected by noise. Such split matrix and split vector quantization techniques may be used to more efficiently and more accurately classify the input signal. Furthermore, additional speech classifiers such as hidden Markov models may be trained, and their stochastic output data may serve as input data to a further speech classifier such as a neural network. Respective hidden Markov models may be designed using quantization data as the observation sequences and a probability algorithm may be such as the Viterbi algorithm to determine likelihood probabilities.
In one embodiment, a new hybrid speech recognition system utilizes frequency parameters and mean compensated frequency parameters in combination with matrix quantization (MQ) and vector quantization (VQ) with Hidden Markov Models (HMMs) to efficiently utilize processing resources and improve speech recognition performance. In another embodiment, a neural network is provided with data generated from the hidden Markov models to further enhance recognition accuracy. This MQ/HMM_VQ/HMMx2_NN system exploits the xe2x80x9cevolutionxe2x80x9d of speech short-term spectral envelopes with error compensation from VQ/HMM processes. Additionally, the neural network, which in one embodiment is a multi-layer perception type neural network, further enhances recognition accuracy. Acoustic noise may affect particular frequency domain subbands. In one embodiment, split matrix and split vector quantizers exploit localized noise by efficiently allocating enhanced processing technology to target noise-affected input signal parameters and minimize noise influence. The enhanced processing technology employs, for example, a weighted LSP and signal energy related distance measure in a LBG algorithm. In another embodiment, matrix and vector quantizers are utilized to process incoming speech data without splitting frequency subbands. In another embodiment, a variety of input data may be provided to the neural network to efficiently maximize recognition accuracy. In a further embodiment, xe2x80x98hardxe2x80x99 decisions, i.e., non-frizzy decisions, are utilized by the respective quantizers to reduce processing resource demand while continuing to use other enhanced recognition resources to achieve high percentage speech recognition accuracy.
In one embodiment, multiple speech processing subsystems receiving frequency parameters and mean compensated frequency parameters are employed to provide initial quantization data to respective speech classifiers. Output data from the speech classifiers may be combined in such a way to compensate for quantization terrors introduced by the speech processing subsystems. In another embodiment, one of the speech processing subsystems includes a vector quantizer which provides quantization information to a speech classifier having hidden Markov models. Another speech processing subsystem includes a matrix quantizer which provides quantization information to another speech classifier having hidden Markov models. Output data from the respective hidden Markov models respectively associated with the vector and matrix quantizers may be mixed using any of a variety of criteria and provided to, for example, a neural network for enhanced recognition accuracy speech classifiers.
In one embodiment of the present invention, a signal recognition system includes a frequency parameter mean compensation module to receive frequency parameters of an input signal and to generate mean compensated frequency parameters from the received input signal frequency parameters, a first quantizer to receive the input signal frequency parameters and to quantize the input signal frequency parameters, and a second quantizer to receive the input signal mean compensated frequency parameters and to quantize the input signal mean compensated frequency parameters. The signal recognition system further includes a backend processor to receive the quantized input signal frequency parameters and the input signal mean compensated input signal frequency parameters and to generate an input signal classification therefrom.
In another embodiment of the present invention, a method includes the steps of sampling an input signal, characterizing the sampled input signal with frequency parameters, generating mean compensated frequency parameters from the frequency , parameters, and providing the frequency parameters to a first quantizer. The method further includes the steps of providing the mean compensated frequency parameters to a second quantizer, and quantizing the frequency parameters with the first quantizer to generate first quantization data, quantizing the mean compensated frequency parameters with the second quantizer to generate second quantization data, and providing the first and second quantization data to a backend processor to classify the input signal.