1. Field of the Invention
This invention relates to speech recognition systems and more particularly to speech recognition systems having a speech input signal quantizer with a single, robust codebook front end to one or more speech classifiers.
2. Description of the Related Art
Speech is perhaps the most important communication method available to mankind. It is also a natural method for man-machine communication. Man-machine communication by voice offers a whole new range of information/communication services which can extend man's capabilities, serve his social needs, and increase his productivity. Speech recognition is a key element in establishing man-machine communication by voice, and, as such, speech recognition is an important technology with tremendous potential for widespread use in the future.
Voice communication between man and machine benefits from an efficient speech recognition interface. Speech recognition interfaces are commonly implemented as Speaker-Dependent (SD)/Speaker-Independent (SI) Isolated Word Speech Recognition (IWSR)/continuous speech recognition (CSR) systems. The SD/SI IWSR/CSR system provides, for example, a beneficial voice command interface for hands free telephone dialing and interaction with voice store and forwarding systems. Such technology is particularly useful in an automotive environment for safety purposes.
However, to be useful, speech recognition must generally be very accurate in correctly recognizing (classifying) the speech input signal with a satisfactory probability of accuracy. Difficulty in correct recognition arises particularly when operating in an acoustically noisy environment. Recognition accuracy may be severely and unfavorably impacted under realistic environmental conditions where speech is corrupted by various levels of acoustic noise.
FIG. 1 generally characterizes a speech recognition process by the speech recognition system 100. A microphone transducer 102 picks up a speech input signal and provides to signal preprocessor 104 an electronic signal representation of the speech input signal 101. The speech input signal 101 is an acoustic waveform of a spoken input, typically a word, or a connecting string of words. The signal hi preprocessor 104 may, for example, filter the speech input signal 101, and a feature extractor 106 extracts selected information from the speech input signal 101 to characterize the signal with, for example, cepstral frequencies or line spectral pair frequencies (LSPs).
Referring to FIG. 2, more specifically, feature extraction in operation 106 is basically a data-reduction technique whereby a large number of data points (in this case samples of the speech input signal 101 recorded at an appropriate sampling rate) are transformed into a smaller set of features which are "equivalent", in the sense that they faithfully describe the salient properties of the speech input signal 101. Feature extraction is generally based on a speech production model which typically assumes that the vocal tract of a speaker can be represented as the concatenation of lossless acoustic tubes (not shown) which, when excited by excitation signals, produces a speech signal. Samples of the speech waveform are assumed to be the output of a time-varying filter that approximates the transmission properties of the vocal tract. It is reasonable to assume that the filter has fixed characteristics over a time interval of the order of 10 to 30 milliseconds (ms). Thus, a short-time speech input signal portion of speech input signal 101 may be represented by a linear, time-invariant all pole filter designed to model the spectral envelope of the signal in each time frame. The filter may be characterized within a given interval by an impulse response and a set of coefficients.
Feature extraction in operation 106 using linear predictive (LP) speech production models has become the predominant technique for estimating basic speech parameters such as pitch, formants, spectra, and vocal tract area functions. The LP model allows for linear predictive analysis which basically approximates a speech input signal 101 as a linear combination of past speech samples. By minimizing the sum of the squared differences (over a finite interval) between actual speech samples and the linearly predicted ones, a unique set of prediction filter coefficients can be determined. The predictor coefficients are weighting coefficients used in the linear combination of past speech samples. The LP coefficients are generally updated very slowly with time, for example, every 10-30 ms, to represent the changing vocal tract. LP prediction coefficients are calculated using a variety of well-known procedures, such as autocorrelation and covariance procedures, to minimize the difference between the actual speech input signal 101 and a predicted speech input signal 101 often stored as a spectral envelope reference pattern. The LP prediction coefficients can be easily transformed into several different representations including cepstral coefficients and line spectrum pair (LSP) frequencies. Details of LSP theory can be found in N. Sugamura, "Speech Analysis and Synthesis Methods Developed at ECL in NTT-from LPC to LSP", Speech Communication 5, Elsevier Science Publishers, B. V., pp. 199-215 (1986).
Final decision-logic classifier 108 utilizes the extracted information to classify the represented speech input signal to a database of representative speech input signals. Speech recognition classifying problems can be treated as a classical pattern recognition problem. Fundamental ideas from signal processing, information theory, and computer science can be utilized to facilitate isolated word recognition and simple connected-word sequences recognition.
FIG. 2 illustrates a more specific speech recognition system 200 based on pattern recognition as used in many IWSR type systems. The extracted features representing speech input signal 101 are segmented into short-term speech input signal frames and considered to be stationary within each frame for 10 to 30 msec duration. The extracted features may be represented by a P-dimensional vector and compared with predetermined, stored reference patterns 208 by the pattern similarity operation 210. Similarity between the speech input signal 101 pattern and the stored reference patterns 208 is determined in pattern similarity operation 210 using well-known vector quantization processes. The vector quantization process yields spectral distortion or distance measures to quantify the score of fitness or closeness between the representation of speech input signal 101 and each of the stored reference patterns 208.
Several types of spectral distance measures have been studied in conjunction with speech recognition including LSP based distance measures such as the LSP Euclidean distance measure (DLSP) and weighted LSP Euclidean distance measure (dWLSP). They are defined by ##EQU1## and ##EQU2## where, f.sub.R (i) and f.sub.S (i) are the ith LSPs of the reference and speech vectors, respectively. The factor "w(i)" is the weight assigned to the ith LSP and P is the order of LPC filter. The weight factor w(i) is defined as: EQU w(i)=[P(f.sub.S (i))].sup.r
where P(f) is the LPC power spectrum associated with the speech vector as a function of frequency, f, and r is an empirical constant which controls the relative weights given to different LSPs. In the weighted Euclidean distance measure, the weight assigned to a given LSP is proportional to the value of LPC power spectrum at this LSP frequency.
The decision rule operation 212 receives the distance measures and determines which of the reference patterns 208 the speech input signal 101 most closely represents. In a "hard" decision making process, speech input signal 101 is matched to one of the reference patterns 208. This one-to-one "hard decision" ignores the relationship of the speech input signal 101 to all the other reference patterns 208. Fuzzy methods have been introduced to provide a better match between vector quantized frames of speech input signal 101 and reference patterns 208. In a "soft" or "fuzzy" decision making process, speech input signal 101 is related to one or more reference patterns 208 by weighting coefficients.
Matrix quantization has also been used to introduce temporal information about speech input signal 101 into decision rule operation 212. Fuzzy analysis methods have also been incorporated into matrix quantization processes, as described in Xydeas and Cong, "Robust Speech Recognition In a Car Environment", Proceeding of the DSP95 International Conference on Digital Signal Processing, Jun. 26-28, 1995, Limassol, Cyprus. Fuzzy matrix quantization allows for "soft" decision using interframe information related to the "evolution" of the short-term spectral envelopes of speech input signal 101. Large reference codeword databases generally offer improved recognition accuracy over conventional speech recognition systems and also increase the amount of data processing time required by a speech recognition system. Thus, implementation of such a speech recognition system with satisfactory speech recognition accuracy can require high performance processing capabilities with a significant amount of database memory for storing reference codebooks. The cost of such speech recognition systems may limit availability to certain consumer markets such as the automotive voice communication market. Reducing the database size may increase processing speech while reducing performance.
Accordingly, a need exists to provide satisfactory robust speech recognition accuracy while using less processing and memory resources.