The present invention relates to speech signal processing. More particularly, the present invention relates to a novel method and apparatus for realizing a distributed implementation of a standard voice recognition system.
Voice recognition represents one of the most important techniques to endow a machine with simulated intelligence to recognize user or uses voiced commands and to facilitate human interface with the machine. It also represents a key technique for human speech understanding. Systems that employ techniques to recover a linguistic message from an acoustic speech signal are called voice recognizers (VR). A voice recognizer is composed of an acoustic processor, which extracts a sequence of information-bearing features (vectors) necessary for VR from the incoming raw speech, and a word decoder, which decodes this sequence of features (vectors) to yield the meaningful and desired format of output, such as a sequence of linguistic words corresponding to the input utterance. To increase the performance of a given system, training is required to equip the system with valid parameters. In other words, the system needs to learn before it can function optimally.
The acoustic processor represents a front end speech analysis subsystem in a voice recognizer. In response to an input speech signal, it provides an appropriate representation to characterize the time-varying speech signal. It should discard irrelevant information such as background noise, channel distortion, speaker characteristics and manner of speaking. Efficient acoustic feature will furnish voice recognizers with higher acoustic discrimination power. The most useful characteristic is the short time spectral envelope. In characterizing the short time spectral envelope, the two most commonly used spectral analysis techniques are linear predictive coding (LPC) and filter-bank based spectral analysis models. However, it is readily shown (as discussed in Rabiner, L. R. and Schafer, R. W., Digital Processing of Speech Signals, Prentice Hall, 1978) that LPC not only provides a good approximation to the vocal tract spectral envelope, but is considerably less expensive in computation than the filter-bank model in all digital implementations. Experience has also demonstrated that the performance of LPC based voice recognizers is comparable to or better than that of filter-bank based recognizers (Rabiner, L. R. and Juang, B. H., Fundamentals of Speech Recognition, Prentice Hall, 1993).
Referring to FIG. 1, in an LPC based acoustic processor, the input speech is provided to a microphone (not shown) and converted to an analog electrical signal. This electrical signal is then digitized by an A/D converter (not shown). The digitized speech signals are passed through preemphasis filter 2 in order to spectrally flatten the signal and to make it less susceptible to finite precision effects in subsequent signal processing. The preemphasis filtered speech is then provided to segmentation element 4 where it is segmented or blocked into either temporally overlapped or nonoverlapped blocks. The frames of speech data are then provided to windowing element 6 where framed DC components are removed and a digital windowing operation is performed on each frame to lessen the blocking effects due to the discontinuity at frame boundaries. A most commonly used window function in LPC analysis is the Hamming window, w(n) defined as:                                           w            ⁡                          (              n              )                                =                      0.54            -                          0.46              ·                              cos                ⁡                                  (                                                            2                      ⁢                                              xe2x80x83                                            ⁢                      π                      ⁢                                              xe2x80x83                                            ⁢                      n                                                              N                      -                      1                                                        )                                                                    ,                  xe2x80x83                ⁢                  0          ≤          n          ≤                      N            -            1                                      1      
The windowed speech is provided to LPC analysis element 8. In LPC analysis element 8 autocorrelation functions are calculated based on the windowed samples and corresponding LPC parameters are obtained directly from autocorrelation functions.
Generally speaking, the word decoder translates the acoustic feature sequence produced by the acoustic processor into an estimate of the speaker""s original word string. This is accomplished in two steps: acoustic pattern matching and language modeling. Language modeling can be avoided in the applications of isolated word recognition. The LPC parameters from LPC analysis element 8 are provided to acoustic pattern matching element 10 to detect and classify possible acoustic patterns, such as phonemes, syllables, words, etc. The candidate patterns are provided to language modeling element 12, which models the rules of syntactic constraints that determine what sequences of words are grammatically well formed and meaningful. Syntactic information can be valuable guide to voice recognition when acoustic information alone is ambiguous. Based on language modeling, the VR sequentially interprets the acoustic feature matching results and provides the estimated word string.
Both the acoustic pattern matching and language modeling in the word decoder requires a mathematical model, either deterministic or stochastic, to describe the speaker""s phonological and acoustic-phonetic variations. The performance of a speech recognition system is directly related to the quality of these two modelings. Among the various classes of models for acoustic pattern matching, template-based dynamic time warping (DTW) and stochastic hidden Markov modeling (HMM) are the two most commonly used. However, it has been shown that DTW based approach can be viewed as a special case of HMM based one, which is a parametric, doubly stochastic model. HMM systems are currently the most successful speech recognition algorithms. The doubly stochastic property in HMM provides better flexibility in absorbing acoustic as well as temporal variations associated with speech signals. This usually results in improved recognition accuracy. Concerning the language model, a stochastic model, called k-gram language model which is detailed in F. Jelink, xe2x80x9cThe Development of an Experimental Discrete Dictation Recognizerxe2x80x9d, Proc. IEEE, vol. 73, pp. 1616-1624, 1985, has been successfully applied in practical large vocabulary voice recognition systems. While in the small vocabulary case, a deterministic grammar has been formulated as a finite state network (FSN) in the application of airline and reservation and information system (see Rabiner, L. R. and Levinson, S. Z., A Speaker-Independent, Syntax-Directed, Connected Word Recognition System Based on Hidden Markov Model and Level Building, IEEE Trans. on LASSP, Vol. 33, No. 3, June 1985.)
Statistically, in order to minimize the probability of recognition error, the voice recognition problem can be formalized as follows: with acoustic evidence observation O, the operations of voice recognition are to find the most likely word string W* such that
xe2x80x83W*=arg max P(W|O)xe2x80x83xe2x80x83(1)
where the maximization is over all possible word strings W. In accordance with Bayes rule, the posteriori probability P(W|O) in the above equation can be rewritten as:                               P          ⁡                      (                          W              |              O                        )                          =                                            P              ⁡                              (                W                )                                      ⁢                          P              ⁡                              (                                  O                  |                  W                                )                                                          P            ⁡                          (              O              )                                                          (        2        )            
Since P(O) is irrelevant to recognition, the word string estimate can be obtained alternatively as:
W*=arg max P(W)P(O|W)xe2x80x83xe2x80x83(3)
Here P(W) represents the a priori probability that the word string W will be uttered, and P(O|W) is the probability that the acoustic evidence O will be observed given that the speaker uttered the word sequence W. P(O|W) is determined by acoustic pattern matching, while the a priori probability P(W) is defined by language model utilized.
In connected word recognition, if the vocabulary is small (less than 100), a deterministic grammar can be used to rigidly govern which words can logically follow other words to form legal sentences in the language. The deterministic grammar can be incorporated in the acoustic matching algorithm implicitly to constrain the search space of potential words and to reduce the computation dramatically. However, when the vocabulary size is either medium (greater than 100 but less than 1000) or large (greater than 1000), the probability of the word sequence, W=(w1,w2, . . . ,wn), can be obtained by stochastic language modeling. From simple probability theory, the prior probability, P(W) can be decomposed as                               P          ⁡                      (            W            )                          =                              P            ⁡                          (                                                w                  1                                ,                                  w                  2                                ,                …                ⁢                                  xe2x80x83                                ,                                  w                  n                                            )                                =                                    ∏                              i                =                1                            n                        ⁢                          xe2x80x83                        ⁢                          P              ⁡                              (                                                                            w                      i                                        |                                          w                      1                                                        ,                                      w                    2                                    ,                  …                  ⁢                                      xe2x80x83                                    ,                                      w                                          i                      -                      1                                                                      )                                                                        (        4        )            
where P(wi|w1,w2, . . . ,wixe2x88x921) is the probability that wi will be spoken given that the word sequence (w1,w2, . . . ,wixe2x88x921) precedes it. The choice of wi depends on the entire past history of input words. For a vocabulary of size V, it requires Vi values to specify P(wi|w1,w2, . . . ,wixe2x88x921) completely. Even for the mid vocabulary size, this requires a formidable number of samples to train the language model. An inaccurate estimate of P(wi|w1,w2, . . . ,wixe2x88x921) due to insufficient training data will depreciate the results of original acoustic matching.
A practical solution to the above problems is to assume that wi only depends on (kxe2x88x921) preceding words, wixe2x88x921,wixe2x88x922, . . . ,wixe2x88x92k+1. A stochastic language model can be completely described in terms of P(wi|w1,w2, . . . ,wixe2x88x92k+1) from which k-gram language model is derived. Since most of the word strings will never occur in the language if k greater than 3, unigram (k=1), bigram (k=2) and trigram (k=3) are the most powerful stochastic language models that take grammar into consideration statistically. Language modeling contains both syntactic and semantic information which is valuable in recognition, but these probabilities must be trained from a large collection of speech data. When the available training data are relatively limited, such as K-grams may never occur in the data, P(wi|wixe2x88x922,wixe2x88x921) can be estimated directly from bigram probability P(wi|wixe2x88x921. Details of this process can be found in F. Jelink, xe2x80x9cThe Development of An Experimental Discrete Dictation Recognizerxe2x80x9d, Proc. IEEE, vol. 73, pp. 1616-1624, 1985. In connected word recognition, whole word model is used as the basic speech unit, while in continuous voice recognition, subband units, such as phonemes, syllables or demisyllables may be used as the basic speech unit. The word decoder will be modified accordingly.
Conventional voice recognition systems integrate acoustic processor and word decoders without taking into account their separability, the limitations of application systems (such as power consumption, memory availability, etc.) and communication channel characteristics. This motivates the interest in devising a distributed voice recognition system with these two components appropriately separated.
The present invention is a novel and improved distributed voice recognition system, in which (i) the front end acoustic processor can be LPC based or filter bank based; (ii) the acoustic pattern matching in the word decoder can be based on hidden Markov model (HMM), dynamic time warping (DTW) or even neural networks (NN); and (iii) for the connected or continuous word recognition purpose, the language model can be based on deterministic or stochastic grammars. The present invention differs from the usual voice recognizer in improving system performance by appropriately separating the components: feature extraction and word decoding. As demonstrated in next examples, if LPC based features, such as cepstrum coefficients, are to be sent over communication channel, a transformation between LPC and LSP can be used to alleviate the noise effects on feature sequence.