Speech recognition is a process by which an unknown speech utterance (usually in the form of a digital PCM signal) is identified. Generally, speech recognition is performed by comparing the features of an unknown utterance to the features of known words or word strings.
Features of known words or word strings are determined with a process known as training. Through training, one or more samples of known words or strings (training speech) are examined and their features (or characteristics) recorded as reference patterns (or recognition unit models) in a database of a speech recognizer. Typically, each recognition unit model represents a single known word. However, recognition unit models may represent speech of other lengths such as subwords (e.g., phones, which are the acoustic manifestation of linguistically-based phonemes). Recognition unit models may be thought of as building blocks for words and strings of words, such as phrases or sentences.
To recognize an unknown utterance, a speech recognizer extracts features from the utterance to characterize it. The features of the unknown utterance are referred to as a test pattern. The recognizer then compares combinations of one or more recognition unit models in the database to the test pattern of the unknown utterance. A scoring technique is used to provide a relative measure of how well each combination of recognition unit models matches the test pattern. The unknown utterance is recognized as the words associated with the combination of one or more recognition unit models which most closely matches the unknown utterance.
There are many types of speech recognizers, e.g., template-based recognizers and hidden Markov model (HMM) recognizers. Recognizers trained using first-order statistics based on known word samples (e.g., spectral means of such samples) to build recognition unit models are known as template-based recognizers. Typically, scoring is accomplished with a time registration technique, such as dynamic time warping (DTW). DTW provides an optimal time alignment between recognition unit models (templates) and test patterns by locally shrinking or expanding the time axes of the templates and the pattern until one optimally matches the other. DTW scoring reflects an overall distance between optimally aligned templates and the test pattern. The template or sequence thereof having the lowest score among all such templates or sequences (i.e., the shortest distance between itself and the test pattern) identifies the test pattern.
Recognizers trained using both first and second order statistics (i.e., spectral means and variances) of known speech samples are known as HMM recognizers. Each recognition unit model in this type of recognizer is an N-state statistical model (an HMM) which reflects these statistics. Each state of an HMM corresponds in some sense to the statistics associated with the temporal events of samples of a known word or subword. An HMM is characterized by a state transition matrix, A (which provides a statistical description of how new states may be reached from old states), and an observation probability matrix, B (which provides a description of which spectral features are likely to be observed in a given state). Scoring of a test pattern reflects the probability of the occurrence of the sequence of features of the test pattern given a particular model. Scoring across all models may be provided by efficient dynamic programming techniques, such as Viterbi scoring. The HMM or sequence thereof which indicates the highest probability of the sequence of features in the test pattern occurring identifies the test pattern.
In addition to template- and HMM-based recognizers, other recognizers includes those which use neural networks as recognition unit models.
Generally, the performance of speech recognizers is closely associated with the effectiveness of the techniques used to train their recognition unit models. Conventional training of, for example, HMM speech recognizers is based on the principle of statistical data fitting which concerns increasing the likelihood that a particular HMM will match the statistics of known recognition unit samples. The success of conventional HMM training is conditioned on the availability of large amounts of training speech samples and proper choice of HMMs. Often, the amount of available training speech is limited, and the assumptions made by chosen HMMs on the speech production process are often inaccurate. As a consequence, likelihood-based training of HMMs may not be very effective. The deficiency of conventional training methods is due to the lack of a direct relation between training and recognition error rate. To illustrate this deficiency, a conventional HMM-based speech recognizer will now be considered in greater detail.
In conventional HMM based speech recognizers, a continuous speech utterance waveform is blocked into frames, and a discrete sequence of feature vectors, X={x.sub.0,x.sub.1, . . . ,x.sub.T(x) }, is extracted, where T(x) corresponds to the total number of frames in the speech signal (input speech utterance may be identified as its feature vector sequence X={x.sub.0,x.sub.1, . . . ,x.sub.T(x) } without confusion).
In the framework of the HMM, the input speech feature vector sequence X is modeled as a noisy observation of the outcomes from a certain discrete time Markov chain from time t=1, . . . ,T(x). Every possible state transition sequence during the time t=1, . . . ,T(x) constitutes a path through the trellis determined by this Markov chain. The probability density function of observing vector x in the j-th state of i-th word HMM (observation probability density) is ##EQU1## which is a mixture of Gaussian distributions, where c.sup.i.sub.j,k are mixture weights which satisfy the criterion ##EQU2##
The optimal path under Viterbi scoring is the one that attains the highest log-likelihood score. If .THETA..sup.i denotes the optimal path of the input utterance X in i-th word HMM .lambda..sub.i, then the log-likelihood score of the input utterance X along its optimal path in i-th model .lambda..sub.i, g.sub.i (X,.lambda..sub.i), can be written as ##EQU3## where .theta..sup.i.sub.t is the corresponding state sequence along the optimal path .theta..sup.i,x.sub.t is the corresponding observation vector at time t, T(X) is the number of frames in the input utterance X, and a.sub..theta..sbsb.t-1.spsb.i.sub..theta..sbsb.t.spsb.i is the state transition probability from state .theta..sub.t-1.sup.i to state .theta..sub.t.sup.i.
In the recognition part of an HMM-based isolated word recognizer using Viterbi scoring, the input utterance is first processed, and the log-likelihood score of the input utterance X at each word model along its optimal path is evaluated. The recognizer classifies the input utterance to the i-th word W.sub.i if and only if i=argmax.sub.j g.sub.j (X,.lambda..sub.j). If the recognition error count function for i-th word is defined as: ##EQU4## then the goal of training an HMM is to reduce the expected error rate: ##EQU5## where the expectation is with respect to X. In practice, the training result is often measured by an empirical error rate for a given set of training speech samples {X.sub.n, n=1,2, . . . , N}: ##EQU6## However, direct minimization of the empirical error rate function (7) has several serious deficiencies. It is numerically difficult to optimize, because classification error count function is not a continuous function. The empirical error rate function does not distinguish near miss and barely correct cases; this may impair recognizer recognizer performance on an independent test data set. Viterbi scoring also add a complexity here, since the form and the value of the empirical error rate function vary with segmentation determined by the HMM parameters. A set of numerically optimal HMM parameters based on the current segmentation does not maintain its optimality under a different segmentation, unless a good convergence result can be proved. It has been a problem of standing interest to find a training method that directly minimizes the recognition error rate and is consistent with HMM framework using Viterbi scoring.
The objective of continuous speech recognition is to identify (i.e., recognize) an underlying word sequence (i.e., string) from an input speech utterance. As discussed above, recognition is performed with use of a set of recognition unit models. Recently, significant research effort has been concentrated on the problem of how to select and represent these speech recognition unit models for use in continuous speech recognition.
One assumption generally made in continuous speech recognition is that a fluently spoken word string may be adequately represented by a linear concatenation of speech recognition unit models (of, e.g., words or subwords) according to a lexical transcription of the words in the string. Conventionally, this has meant a concatenation of recognition unit models trained directly from segmented training tokens (such as words). Segmentation of training speech into tokens corresponding to recognition unit models is generally unreliable, calling into question the validity of the recognition unit models themselves. However, building a model for each possible string, as distinct from models describing components of strings, has been avoided since the number of possible word strings (as distinct from words or subwords) is large.
Training of recognition unit models in isolation is not consistent with the way in which continuous speech recognition is done. In continuous speech recognition, scoring is performed on a string level. That is, recognition is predicated on how well a concatenation of recognition unit models compares with and entire unknown word string. One concatenation of models will be selected over another based on how well each of the concatenations compares in the aggregate to the unknown string. This aggregate comparison may be referred to as a global score of the concatenation. Thus, should a continuous speech recognizer make an error, it does so based on comparisons made at a global or "string" level, not at a "word" or "subword" (i.e., "substring") level. Yet, it is the substring level at which these recognition unit models have been trained. Recognition unit models which have been trained at a substring level may not be well suited for use at a string level.
Because of this discrepancy in training and recognition philosophy, as well as the difficulty in accurately locating and segmenting word or subword boundaries, continuous speech recognizer performance may be less than desirable.