1. Field of the Invention
The invention relates to a method for recognizing an input pattern which is derived from a continual physical quantity. The invention also relates to a system for recognizing a time-sequential input pattern, which is derived from a continual physical quantity.
2. Description of the Related Art
Recognition of a time-sequential input pattern, which is derived from a continual physical quantity, such as speech or images, is increasingly getting important. Particularly, speech recognition has recently been widely applied to areas such as telephone and telecommunications (various automated services), office and business systems (data entry), manufacturing (hands-free monitoring of manufacturing processes), medical (annotating of reports), games (voice input), voice-control of car functions and voice-control used by disabled people. For continuous speech recognition, the following signal processing steps are commonly used, as illustrated in FIG. 1 refer L. Rabiner "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceeding of the IEEE, Vol. 77, No. 2, February 1989!:
Feature analysis: the speech input signal is spectrally and/or temporally analyzed to calculate a representative vector of features (observation vector o). Typically, the speech signal is digitized (e.g., sampled at a rate of 6.67 kHz.) and pre-processed, for instance, by applying pre-emphasis. Consecutive samples are grouped (blocked) into frames, corresponding to, for instance, 32 msec. of speech signal. Successive frames partially overlap, for instance, 16 msec. Often the Linear Predictive Coding (L.C.) spectral analysis method is used to calculate for each frame a representative vector of features (observation vector o). The feature vector may, for instance, have 24, 32 or 63 components (the feature space dimension). PA1 Lexical decoding: if sub-word units are used, a pronunciation lexicon describes how words are constructed of sub-word units. The possible sequence of sub-word units, investigated by the unit matching system, is then constrained to sequences in the lexicon. PA1 Syntactical analysis: further constraints are placed on the unit matching system so that the paths investigated are those corresponding to speech units which comprise words (lexical decoding) and for which the words are in a proper sequence as specified by a word grammar.
Unit matching system: the observation vectors are matched against an inventory of speech recognition units. Various forms of speech recognition units may be used. Some systems use linguistically based sub-word units, such as phones, diphones or syllables, as well as derivative units, such as fenenes and fenones. Other systems use a whole word or a group of words as a unit. The so-called hidden Markov model (HMM) is widely used to stochastically model speech signals. Using this model, each unit is typically characterized by an HMM, whose parameters are estimated from a training set of speech data. For large vocabulary speech recognition systems involving, for instance, 10,000 to 60,000 words, using a limited set of, for instance 40, sub-word units is used, since it would require a lot of training data to adequately train an HMM for larger units. The unit matching system matches the observation vectors against all sequences of speech recognition units and provides the likelihoods of a match between the vector and a sequence. Constraints can be placed on the matching, for instance by:
A discrete Markov process describes a system which at any time is in one of a set on N distinct states. At regular times, the system changes state according to a set of probabilities associated with the state. A special form of a discrete Markov process is shown in FIG. 2. In this so-called left-right model, the states proceed from left to right (or stay the same). This model is widely used for modeling speech, where the properties of the signal change over time. The model states can be seen as representing sounds. The number of states in a model for a sub-word unit could, for instance, be five or six., in which case, in average, a state corresponds to an observation interval. The model of FIG. 2 allows a state to stay the same, which can be associated with slow speaking. Alternatively, a state can be skipped, which can be associated with speaking fast (in FIG. 2, up to twice the average rate). The output of the discrete Markov process is the set of states at each instance of time, where each state corresponds to an observable event. For speech recognition system, the concept of discrete Markov processes is extended to the case where an observation is a probabilistic function of the state. This results in a double stochastic process. The underlying stochastic process of state changes is hidden (the hidden Markov model, HMM) and can only be observed through a stochastic process that produces the sequence of observations.
For speech, the observations represent continuous signals. The observations can be quantized to discrete symbols chosen from a finite alphabet of, for instance, 32 to 256 vectors. In such a case, a discrete probability density can be used for each state of the model. In order to avoid degradation associated with quantizing, many speech recognition systems use continuous mixture densities. Generally, the densities are derived from log-concave or elliptically symmetric densities, such as Gaussian (normal distribution) or Laplacian densities. During training, the training data (training observation sequences) is segmented into states using an initial model. This gives for each state a set of observations, referred to as training observation vectors or reference vectors. Next, the reference vectors for each state are clustered. Depending on the complexity of the system and the amount of training data, there may, for instance, be between 32 to 120 elementary clusters for each state. Each elementary cluster has its own probability density, referred to as reference probability density. The resulting mixture density for the state is then a weighted sum of the reference probability densities for a state.
To recognize a single speech recognition unit (e.g., word or sub-word unit) from a speech signal (observation sequence), for each speech recognition unit the likelihood is calculated that it produced the observation sequence. The speech recognition unit with maximum likelihood is selected. To recognize larger sequences of observations, a leveled approach is used. Starting at the first level, likelihoods are calculated as before. Whenever the last state of a model is reached, a switch is made to a higher level, repeating the same process for the remaining observations. When the last observation has been processed, the path with the maximum likelihood is selected and the path is backtracked to determine the sequence of involved speech recognition units.
The likelihood calculation involves calculating in each state the likelihood of the observation (feature vector) for each reference probability density for that state. Particularly, in large vocabulary speech recognition systems using continuous observation density HMMs, with, for instance, 40 sub-word units, 5 states per sub-word unit and 64 clusters per state this implies 12800 likelihood calculations for, for instance, 32 dimensional vectors. These calculations are repeated for each observation. Consequently, the likelihood calculation may consume 50%-75% of the computing resources.
It is known from EP-A-627-726 to reduce the percentage of time required for the likelihood calculation by organizing the reference probability densities, using a tree structure, and performing a tree search. At the lowest level of the tree (level 1), each of the leaf nodes corresponds to an actual reference probability density. As described earlier, a reference probability density represents an elementary cluster of reference vectors. At level two of the tree, each non-leaf node corresponds to a cluster probability density, which is derived from all reference probability densities corresponding to leaf nodes in branches below the non-leaf node. As such, a level two non-leaf node represents a cluster of a cluster of reference vectors. This hierarchical clustering is repeated for successively higher levels, until at the highest level of the tree, one non-leaf node (the root node) represents all reference vectors. During the pattern recognition, for each input observation vector, a tree search is performed starting at one level below the root. For each node at this level, the corresponding cluster probability density is used to calculate the likelihood of the observation vector. One or more nodes with maximum likelihood are selected. For these nodes, the same process is repeated one level lower. In this manner, finally a number of leaf nodes are selected for which the corresponding reference probability density is used to calculate the likelihood of the observation vector. For each leaf node which is not selected, the likelihood is approximated by the likelihood of its mother node, which was last selected.