1. Field of the Invention
The invention relates to a method for recognising an input pattern which is derived from a continual physical quantity; said method comprising:
accessing said physical quantity and therefrom generating a plurality of input observation vectors, representing said input pattern; PA1 locating among a plurality of reference patterns a recognised reference pattern, which corresponds to said input pattern; at least one reference pattern being a sequence of reference units; each reference unit being represented by at least one associated reference vector .mu..sub.a in a set {.mu..sub.s } of reference vectors; said locating comprising selecting for each input observation vector o a subset {.mu..sub.s } of reference vectors from said set {.mu..sub.a } and calculating vector similarity scores between said input observation vector o and each reference vector .mu..sub.s of said subset {.mu..sub.s }. PA1 input means for accessing said physical quantity and therefrom generating a plurality of input observation vectors, representing said input pattern; PA1 a reference pattern database for storing a plurality of reference patterns; at least one reference pattern being a sequence of reference units; each reference unit being represented by at least one associated reference vector .mu..sub.a in a set {.mu..sub.a } of reference vectors; PA1 a localizer for locating among the reference patterns stored in said reference pattern database a recognised reference pattern, which corresponds to said input pattern; said locating comprising selecting for each input observation vector o a subset {.mu..sub.s } of reference vectors from said set {.mu..sub.a } and calculating vector similarity scores between said input observation vector o and each reference vector .mu..sub.s of said subset {.mu..sub.s }; and PA1 output means for outputting said recognised pattern. PA1 Feature analysis: the speech input signal is spectrally and/or temporally analyzed to calculate a representative vector of features (observation vector o). Typically, the speech signal is digitised (e.g. sampled at a rate of 6.67 kHz.) and pre-processed, for instance by applying pre-emphasis. Consecutive samples are grouped (blocked) into frames, corresponding to, for instance, 32 msec. of speech signal. Successive frames partially overlap, for instance, 16 msec. Often the Linear Predictive Coding (LPC) spectral analysis method is used to calculate for each frame a representative vector of features (observation vector o). The feature vector may, for instance, have 24, 32 or 63 components (the feature space dimension). PA1 Unit matching system: the observation vectors are matched against an inventory of speech recognition units. Various forms of speech recognition units may be used. Some systems use linguistically based sub-word units, such as phones, diphones or syllables, as well as derivative units, such as fenenes and fenones. Other systems use a whole word or a group of words as a unit. The so-called hidden Markov model (HMM) is widely used to stochastically model speech signals. Using this model, each unit is typically characterised by an HMM, whose parameters are estimated from a training set of speech data. For large vocabulary speech recognition systems involving, for instance, 10,000 to 60,000 words, usually a limited set of, for instance 40, sub-word units is used, since it would require a lot of training data to adequately train an HMM for larger units. The unit matching system matches the observation vectors against all sequences of speech recognition units and provides the likelihoods of a match between the vector and a sequence. Constraints can be placed on the matching, for instance by: PA1 in that said method comprises quantising each reference vector .mu..sub.a to a quantised reference vector R(.mu..sub.a), and PA1 in that selecting the subset {.mu..sub.s } of reference vectors comprises for each input observation vector o the steps of: PA1 calculating the L.sub.r -norm .parallel..mu..sub.a .parallel..sub.r of each vector .mu..sub.a, and PA1 for each input observation vector o: PA1 calculating a difference vector by assigning to each component of said difference vector the binary XOR value of the corresponding components of S(o) and S(.mu..sub.a); PA1 determining a difference number by calculating how many components in said difference vector have the value one, and PA1 using said difference number as the Hamming distance. PA1 in that said method comprises constructing a table specifying for each N-dimensional vector, with components having a binary value of zero or one, a corresponding number, indicating how many components have the value one; and PA1 in that determining said difference number comprises locating said difference vector in said table and using said corresponding number as the Hamming distance. By counting in advance the number of one elements in a vector and storing this in a table, the performance is increased further. PA1 after selecting said subset of reference vectors for an input observation vector o, ensuring that each reference unit is represented by at least one reference vector in said subset, by adding for each reference unit, which is not represented, a representative reference vector to said subset. The accuracy of the recognition is improved if each reference unit is represented in the subset. PA1 after ensuring that each reference unit is represented by at least one reference vector in said subset, choosing for each reference unit said representative reference vector by selecting as the representative reference vector the reference vector from the subset, which represents said reference unit and has a smallest distance to said input observation vector o. Since the observation vectors tend to change gradually, a reference vector which was found to be the best representation of a reference unit for a specific observation vector is a good candidate for supplementing the subset for a subsequent observation vector. PA1 in that selecting the subset {.mu..sub.s } of reference vectors comprises for each input observation vector o the steps of: PA1 in that said localizer calculates said distances d(R(o), R(.mu..sub.a)) by, for each input observation vector o: PA1 in that determining said difference number comprises locating said difference vector in said table and using said corresponding number as the Hamming distance.
The invention also relates to a system for recognising a time-sequential input pattern, which is derived from a continual physical quantity; said system comprising:
2. Description of the Related Art
Recognition of a time-sequential input pattern, which is derived from a continual physical quantity, such as speech or images, is increasingly getting important. Particularly, speech recognition has recently been widely applied to areas such as Telephone and telecommunications (various automated services), Office and business systems (data entry), Manufacturing (hands-free monitoring of manufacturing processes), Medical (annotating of reports), Games (voice input), voice-control of car functions and voice-control used by disabled people. For continuous speech recognition, the following signal processing steps are commonly used, as illustrated in FIG. 1 refer L.Rabiner "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceeding of the IEEE, Vol. 77, No. 2, February 1989!:
Lexical decoding: if sub-word units are used, a pronunciation lexicon describes how words are constructed of sub-word units. The possible sequence of sub-word units, investigated by the unit matching system, is then constrained to sequences in the lexicon. PA2 Syntactical analysis: further constraints are placed on the unit matching system so that the paths investigated are those corresponding to speech units which comprise words (lexical decoding) and for which the words are in a proper sequence as specified by a word grammar. PA2 quantising said input observation vector o to a quantised observation vector R(o); PA2 calculating for said quantised observation vector R(o) distances d(R(o), R(.mu..sub.a)) to each quantised reference vector R(.mu..sub.a); and PA2 using said distance d(R(o), R(.mu..sub.a)) as said measure of dissimilarity between said input observation vector o and said reference vector .mu..sub.a. By quantising the vectors, the complexity of the vectors is reduced, making it possible to effectively calculate the distance between the quantised observation vector and the quantised reference vectors. This distance between the quantised vectors, which can be seen as an estimate of the distance between the actual vectors, is used to select the subset. PA2 calculating the L.sub.r -norm .parallel.o.parallel..sub.r of the vector o; and PA2 calculating a Hamming distance H(S(o), S(.mu..sub.a)) of the vector S(o) to each vector S(.mu..sub.a). PA2 quantising said input observation vector o to a quantised observation vector R(o); PA2 calculating for said quantised observation vector R(o) distances d(R(o), R(.mu..sub.a)) to each quantised reference vector R(.mu..sub.a); and PA2 using said distance d(R(o), R(.mu..sub.a)) as said measure of dissimilarity between said input observation vector o and said reference vector .mu..sub.a. PA2 calculating the L.sub.r -norm .parallel.o.parallel..sub.r of the vector o and a Hamming distance H(S(o), S(.mu..sub.a)) of the vector S(o) to each vector S(.mu..sub.a), and PA2 combining the L.sub.r -norm .parallel.o.parallel..sub.r and the Hamming distance H(S(o), S(.mu..sub.a)) with the L.sub.r -norm .parallel..mu..sub.a .parallel..sub.r stored in said reference pattern database.
A discrete Markov process describes a system which at any time is in one of a set of N distinct states. At regular times the system changes state according to a set of probabilities associated with the state. A special form of a discrete Markov process is shown in FIG. 2. In this so-called left-right model, the states proceed from left to right (or stay the same). This model is widely used for modelling speech, where the properties of the signal change over time. The model states can be seen as representing sounds. The number of states in a model for a sub-word unit could, for instance be, five or six. In which case a state, in average, corresponds to an observation interval. The model of FIG. 2 allows a state to stay the same, which can be associated with slow speaking. Alternatively, a state can be skipped, which can be associated with speaking fast (in FIG. 2 up to twice the average rate). The output of the discrete Markov process is the set of states at each instance of time, where each state corresponds to an observable event. For speech recognition systems, the concept of discrete Markov processes is extended to the case where an observation is a probabilistic function of the state. This results in a double stochastic process. The underlying stochastic process of state changes is hidden (the hidden Markov model, HMM) and can only be observed through a stochastic process that produces the sequence of observations.
For speech, the observations represent continuous signals. The observations can be quantised to discrete symbols chosen from a finite alphabet of, for instance, 32 to 256 vectors. In such a case a discrete probability density can be used for each state of the model. In order to avoid degradation associated with quantising, many speech recognition systems use continuous observation densities. Generally, the densities are derived from log-concave or elliptically symmetric densities, such as Gaussian (normal distribution) or Laplacian densities. During training, the training data (training observation sequences) is segmented into states using an initial model. This gives for each state a set of observations. Next, the observation vectors for each state are clustered. Depending on the complexity of the system and the amount of training data, there may, for instance, be between a 32 to 120 clusters for each state. Each cluster has its own density, such as a Gaussian density. The density is represented by a reference vector, such as a mean vector. The resulting observation density for the state is then a weighted sum of the cluster densities.
To recognise a single speech recognition unit (e.g. word or sub-word unit) from a speech signal (observation sequence), for each speech recognition unit the likelihood is calculated that it produced the observation sequence. The speech recognition unit with maximum likelihood is selected. To recognise larger sequences of observations, a levelled approach is used. Starting at the first level, likelihoods are calculated as before. Whenever the last state of a model is reached a switch is made to a higher level, repeating the same process for the remaining observations. When the last observation has been processed, the path with the maximum likelihood is selected and the path is backtraced to determine the sequence of involved speech recognition units.
The likelihood calculation involves calculating in each state the distance of the observation (feature vector) to each reference vector, which represents a cluster. Particularly in large vocabulary speech recognition systems using continuous observation density HMMs, with, for instance, 40 sub-word units, 5 states per sub-word unit and 64 clusters per state this implies 12800 distance calculations between, for instance, 32 dimensional vectors. These calculations are repeated for each observation. Consequently, the likelihood calculation may consume 50%-75% of the computing resources. It is known from E. Bocchieri "Vector quantization for the efficient computation of continuous density likelihoods", Proceeding of ICASSP, 1993, pp. 692-695 to select for each observation vector o a subset of densities (and corresponding reference vectors) and calculate the likelihood of the observation vector for the subset. The likelihood of the densities, which are not part of the selected subset, are approximated. According to the known method, during training all densities are clustered into neighbourhoods. A vector quantiser, consisting of one codeword for each neighbourhood, is also defined. For each codeword a subset of densities, which are near the codeword, is defined. This definition of subsets is done in advance, for instance during the training of the system. During recognition, for each observation vector a subset is selected from the predefined subsets by quantising the observation vector to one of the codewords and using the subset defined for the codeword as the subset of densities for which the likelihood of the observation is calculated. The disadvantage of this approach is that the subsets are statically defined based on the given reference vectors. Particularly for an observation vector which is near boundaries of the predetermined subsets, the selected subset may actually contain many reference vectors which are further from the observation vector than reference vectors in neighbouring subsets. Therefore, to achieve a low pattern error rate, the selected subset needs to be relatively large.