The task of automatic speech recognition (ASR) essentially consists of decoding a word sequence from a continuous speech signal. In order to achieve reasonable levels of performance, past ASR systems have constrained the permissible speech input in order to simplify the decoding task. Typical constraints are (i) speaker dependency, i.e., training the system for each individual speaker, (ii) word quantity, i.e., limiting the system vocabulary to a small number of words or requiring input to be isolated words only, and (iii) read speech (as opposed to also permitting spontaneous speech), or some combination of (i) through (iii). Recently however, state-of-the-art systems have been able to achieve reasonable performance levels for speaker independent, continuous/spontaneous speech systems, operating with vocabularies of greater than 5,000 words.
A block diagram of the major components of a typical ASR system 10 is shown in FIG. 1. Typically, the samples of the continuous speech signal 12 are first processed by a signal processor 14 to form a discreet sequence of observation vectors 18. The components of the observation vectors are the acoustic attributes that have been chosen to represent the signal 12. Examples of commonly chosen attributes are Discrete Fourier Transform based spectral coefficients or auditory model parameters. Each observation vector 18 is called a frame of speech, and the sequence of T frames forms the signal representation, O={o.sub.1, o.sub.2, . . . , o.sub.T }. Acoustic and language models 20, 22 are then used to score the frame sequence O, search a lexicon and hypothesize word sequences. The models 20, 22, search and scoring procedure 24 are highly implementation dependent.
As the number of words in the lexicon 26 becomes large, the task of training individual word models becomes prohibitive. Consequently an intermediate level of representation is generally used. A common representation involves describing the pronunciation of a word in terms of phonemes. A phoneme is an abstract linguistic unit. Changing a phoneme changes the meaning of a word. For example, if the phoneme /p/ in the word "pit" is changed to a /b/, the word becomes "bit". A small number of phonemes can be used to describe all the words in a given language (English consists of roughly 40 phonemes). By representing word pronunciations as a sequence of phonemes, the number of acoustic models and the required training data can be drastically reduced.
Phonemes can be realized in a variety of acoustically distinct manners depending on the phonetic context (e.g., syllable position, neighboring phones), the stress, the speaker, and other factors. The actual acoustic realization of a phoneme is known as a phone. This distinction between a phoneme and a phone is an important one. The different acoustic realizations of the same phoneme do not affect the meaning of a word. An example of this often occurs in the word "butter" where the phoneme /t/ is frequently realized in American English as a "flap" (a particular phone). The acoustic variability that can occur when realizing the same phoneme is part of what makes the task of identifying a phoneme so challenging. The standard distinction is to utilize / / to indicate a phoneme and [ ] to indicate a phone.
The acoustic models are generally trained to recognize some set of phones (the exact set being a design decision). The task of decoding a phone sequence is known as "phonetic recognition," and the resulting output is known as a phonetic transcription. The phonetic transcription may or may not be mapped to a string of phonemes, but regardless, it is a fundamental importance to the ASR task since it is the foundation upon which the word string search is based. Virtually all modern, state-of-the-art speech systems utilize phonetic models as a basis for recognition.
Phonetic recognition methods tend to fall into two categories. The first, and most widely used, is "frame" based. Each observation frame in the sequence O={o.sub.1, . . . , o.sub.T } receives a score for each phonetic model in the system. There is no presegmentation of the signal into larger units. An example of a frame-based phonetic recognition method is the Hidden Markov Models (HMM's). HMM's consists of a set of states connected to each other via transition probabilities. While occupying a state, observations are generated randomly from a probability density function. The transition probabilities and output distributions together constitute an HMM model. The key assumption inherent in an HMM is that the observations are independent, given the state sequence up to the current time.
Thus HMM's handle certain temporal aspects of the speech problem in an elegant manner. The variability of durations over a phone training set is handled automatically by the fact that the state sequence can be as long or short as necessary. Another advantage of the HMM approach is that it does not require an explicit temporal alignment, or segmentation, of the speech signal. Since each frame in an utterance receives its own score, the likelihood scores for alternative segmentations can be directly compared to each other. The alignment which results in the best score for the entire utterance is then chosen. Finally, an efficient technique, the Baum-Welch reestimation algorithm, exists for training HMM's.
In HMM's,temporat correlations are represented implicitly through the statistics of the state sequence and are not modelled explicitly. However, it has been demonstrated that significant temporal correlations do exist. See V. Digilakis, "Segment-Based Stochastic Models of Spectral Dynamics for Continuous Speech Recognition", Ph. D. Thesis, Boston University, 1992. Also see W. Goldenthal and J. Glass, "Modelling Spectral Dynamics for Vowel Classification," Proc. Eurospeech 93, pp. 289-292, Berlin, Germany, (September 1993), incorporated herein by reference.
There have also been attempts to explicitly model the dynamics of the acoustic attributes within an HMM framework. Generally this has been done with some-success, by incorporating first (and possibly second) order differences of the acoustic parameters in the observation vector. Other approaches are segmental HMM's proposed by Russell and Marcus and state-conditioned trend functions used by Deng. See "A Segmental HMM for Speech Pattern Modelling", by M. Russell in Proceedings of the ICASSP 93, pages 499-502, Minneapolis, Minn. April 1993; "Phonetic Recognition in a Segment-Base HMM" by J. Marcus in Proceedings of the ICASSP 93, pages 479-482 Minneapolis, Minn. April 1993; and "A Generalized Hidden Markov Model With State-Conditioned Trend Functions of Time for the Speech Signal" by L. Deng, Signal Processing 27, Vol. 1, pages 65-78 April 1992. None of these approaches have gained general acceptance within the community or been shown to generate results superior to more traditional HMM's.
A second type of phonetic recognition method involves a "segment" based approach. These methods hypothesize start and end times of larger units within the speech signal which generally represent individual phonetic units of speech. An example of a segment-based method is the Stochastic Segment Models (SSM). SSM's are a segment-based approach that attempts to both model the spectral dynamics of a phonetic unit and to capture the temporal correlation within a phonetic segment. However, SSM's impose a very high dimensionality on the Gaussian probability density functions used to estimate the correlations (on the order of 112 to 140). As a consequence, no implementation of this method has yet to successfully incorporate the temporal correlation information. In fact, an implementation utilizing only the temporal correlations performed slightly worse than an implementation which assumed complete statistical independence. See S. Roucos, M. Ostendorf, H. Gish, A. Derr, "Stochastic Segment Modelling Using the Estimate-Maximize Algorithm", in Proceedings ICASSP 88, pages 127-130, April 1988.
As between segment-based and frame-based methods, segment based systems offer the potential advantage of being able to accurately capture segment level dynamics as well as directly modelling temporal correlations within the segment. Also, segment level features, such as segment duration, are easily incorporated. The advantage of a frame-based system is that each frame receives its own score and the scores for different transcription candidates are directly comparable. In a segment-based frame work, it can be difficult to compare utterance likelihoods which propose different numbers of segments. Also, a frame-based system tends to have a computational advantage since the segmentation step does not have to be explicitly performed.
Further, other methods for phonetic recognition include template-based approaches, statistical approaches and more recently approaches based on dynamic modeling and neural networks. A recursive error propagation neural network approach has been used with the TIMIT speech corpus. See T. Robinson, "Several Improvements to a Recurrent Error Propagation Phone Recognition System", Technical Report CUED/TINFENG/TR. 82, 1991. An inherent drawback of neural networks is a large amount of time needed to train the models.