Speech recognition is a process by which an unknown speech utterance (usually in the form of a digital PCM signal) is identified. Generally, speech recognition is performed by comparing the features of an unknown utterance to the features of known words or word strings.
The features of known words or word strings are determined with a process known as "training". Through training, one or more samples of known words or strings (training speech) are examined and their features (or characteristics) recorded as reference patterns (or recognition models) in a database of a speech recognizer. Typically, each recognition model represents a single known word. However, recognition models may represent speech of other lengths such as subwords (e.g., phones, which are the acoustic manifestation of linguistically-based phonemes). Recognition models may be thought of as building blocks for words and strings of words, such as phrases or sentences.
To recognize an utterance in a process known as "testing", a speech recognizer extracts features from the utterance to characterize it. The features of the unknown utterance are referred to as a test pattern. The recognizer then compares combinations of one or more recognition models in the database to the test pattern of the unknown utterance. A scoring technique is used to provide a relative measure of how well each combination of recognition models matches the test pattern. The unknown utterance is recognized as the words associated with the combination of one or more recognition models which most closely matches the unknown utterance.
Recognizers trained using both first and second order statistics (i.e., spectral means and variances) of known speech samples are known as hidden Markov model (HMM) recognizers. Each recognition model in this type of recognizer is an N-state statistical model (an HMM) which reflects these statistics. Each state of an HMM corresponds in some sense to the statistics associated with the temporal events of samples of a known word or subword. An HMM is characterized by a state transition matrix, A (which provides a statistical description of how new states may be reached from old states), and an observation probability matrix, B (which provides a description of which spectral features are likely to be observed in a given state). Scoring a test pattern reflects the probability of the occurrence of the sequence of features of the test pattern given a particular model. Scoring across all models may be provided by efficient dynamic programming techniques, such as Viterbi scoring. The HMM or sequence thereof which indicates the highest probability of the sequence of features in the test pattern occurring identifies the test pattern.
Hidden Markov models (HMMs) for automatic speech recognition (ASR) rely on high dimensional feature vectors to summarize the short-time, acoustic properties of speech. Though front-ends vary from speech recognizer to speech recognizer, the spectral information in each frame of speech is typically codified in a feature vector with thirty or more dimensions. In most systems, these vectors are conditionally modeled by mixtures of Gaussian probability density functions (PDFs). If so, the correlations between different features are represented in two ways; implicitly by the use of two or more mixture components, and explicitly by the non-diagonal elements in each covariance matrix. Naturally, these strategies for modeling correlations, implicit versus explicit, involve tradeoffs in accuracy, speed and memory.
Currently, most HMM-based recognizers do not include any explicit modeling of correlations; that is to say, conditioned on the hidden states, acoustic features are modeled by mixtures of Gaussian PDFs with diagonal covariance matrices. One reason for this practice is that the use of full covariance matrices imposes a heavy computational burden, making it difficult to achieve real-time speech recognition. Also, one rarely has enough training data to reliably estimate full covariance matrices. Some of these disadvantages can be overcome by parameter-tying (e.g., sharing the covariance matrices across different states or models), such as described in the article by Bellegarda, J., and Nahamoo, D., entitled "Tied mixture continuous parameter modeling for speech recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing 38:2033-2045 (1990). But parameter-tying has its own drawbacks: it considerably complicates the training procedure, and it requires some artistry to know which states should and should not be tied.