The objective of continuous speech recognition is to identify (i.e., recognize) an underlying word sequence from an input speech utterance. Recognition is performed with use of a set of speech recognition patterns or models (hereinafter, models). These basic speech recognition models are the building blocks for words and strings of words, such as phrases or sentences. Recently, significant research effort has been concentrated on the problem of how to select and represent these basic speech recognition units for use in continuous speech recognition.
One conventional approach to the continuous speech recognition problem is that of statistical pattern recognition using acoustic recognition models, such as templates or hidden Markov models (HMMs). Based on a lexical description of a vocabulary, acoustic speech recognition models are prescribed and their parameters are then statistically determined through a process known as training. The basic models may reflect vocabulary words or subwords (such as phones that are the acoustic manifestation of linguistically-based phonemes). One assumption generally made in this approach to continuous speech recognition is that a fluently spoken sequence of words, i.e. a word string, may be adequately represented by a linear concatenation of the basic speech recognition models (of words or subwords) according to the lexical transcription of the words in the string. Conventionally, this has meant a concatenation of speech recognition models estimated directly from training tokens (such as words). A concatenation of acoustic recognition models forms a model of the word string and is a type of word string model. In continuous speech recognition, multiple string models are hypothesized for a given recognition task. Each such string model is compared with a continuous utterance to be recognized. The closeness of each comparison is indicated by a recognition score. The string model which most closely compares with the continuous utterance "recognizes" the utterance.
Another conventional approach to continuous speech recognition is to augment word string models with non-acoustic recognition models. These non-acoustic recognition models include, inter alia, language models, phonetic-based models, semantic models, syntactic models, and other knowledge sources (e.g., pitch, energy, speaking rate, duration, etc.). In such an approach, a word string may be modeled as a combination of acoustic models, language models, etc. Recognition scores from individual models are incorporated into an overall string model recognition score. The incorporation of scores into a string model recognition score is accomplished by, for example, a weighted sum of individual recognition scores from individual string models.
Conventionally, the training of individual recognition models is performed on an individualized basis. In acoustic model training, for example, training speech is segmented into individual word or subword training tokens. Individual acoustic models are therefore trained with training tokens which have been isolated from a longer training utterance. Moreover, acoustic and other models are trained individually, while the parameters used to combine such models for purposes of recognition may be selected hueristically, separate and apart from the training of other models.
All of this individualized training belies the fact that such models will be used together for purposes of continuous speech recognition. That is, continuous speech recognition is predicated on how well a combination of models (i.e., a string model) compares with an entire unknown string. One combination of models will be selected over another based on how well each string model compares to the unknown string in the aggregate. This aggregate comparison may be referred to as a global score of the combination. Thus, should a continuous speech recognizer make an error, it does so based on comparisons made at a global or string level, not at the individualized levels at which models or other information sources have been trained. Because of the level "discrepancy" between training and recognition philosophy, continuous speech recognition performance may be less than desired.