This invention relates to recognition of an input pattern which is typically representative of either discrete words or connected words. More particularly, this invention relates to a pattern recognition method and to a pattern recognition device.
Various discrete or connected word recognition devices are in practical use. Among such pattern recognition devices, representative are one using a dynamic programming (DP) algorithm and one in which continuous mixture density hidden Markov models (HMM) are used.
According to the dynamic programming algorithm, best match is located in accordance with the dynamic programming algorithm between an input pattern represented by a time sequence of input pattern feature vectors and a plurality of reference patterns, each represented by a stored sequence of reference pattern feature vectors. The best match is decided by finding a shortest one of pattern distances or a greatest one of pattern similarities between the input pattern and the reference patterns. On finding either the shortest pattern distance or the greatest pattern similarity, a time axis of the input pattern time sequence and each of similar axes of the reference pattern sequences are mapped each on another by a warping function. Details of the dynamic programming algorithm are described in the Japanese language (transliterated according to ISO 3602) by Nakagawa-Seiiti in a book entitled "Kakuritu Moderu ni yoru Onsei Ninsiki" (Speech Recognition by Probability Models) and published 1988 by the Institute of Electronics, Information, and Communication Engineers of Japan.
Briefly describing, the dynamic programming algorithm proceeds in principle as follows in the manner described on pages 18 to 20 of the Nakagawa book. An input pattern X and a reference pattern B are represented by: EQU X=x.sup.1, x.sup.2, . . ., x.sup.t, . . . , x.sup.T ( 1)
and EQU B=b.sup.1, b.sup.2, . . . , b.sup.j, . . . , b.sup.J, (2)
where x.sup.t represents an input pattern feature vector at an input pattern time instant t, b.sup.j representing a reference pattern feature vector at a reference pattern time instant j, T representing an input pattern length, J representing a reference pattern length.
In general, such reference patterns have different reference pattern lengths. The input pattern length is different from the reference pattern lengths. In order to calculate the pattern distance between the input pattern and each reference pattern which is time sequentially used at consecutive reference pattern time instants, time correspondence must be established between the input and the reference pattern time instants. Each reference pattern time instant j is consequently related, for examples to an input pattern time instant j(t) by a warping or mapping function: EQU j=j(t).
Representing the pattern distance by DX, B!, a minimization problem is solved: ##EQU1## where d(t, d) represents a vector distance between the input and the reference pattern feature vectors x.sup.t and b.sup.j. Usually, a Euclidean distance: EQU .parallel.x.sup.t -b.sup.j .parallel..sup.2 ( 3)
is used as the vector distance.
The minimization problem is solved by calculating, under an initial condition: EQU g(1, 1)=d(1, 1),
a recurrence formula: ##EQU2## where g(t, j) is often called an accumulated distance. In the recurrence formula, the reference pattern time instant is consecutively varied from 1 up to J for each input pattern time instant which is consecutively varied from 1 up to T. The minimum distance is given by an ultimate cumulative distance g(T, J). Various other recurrence formulae and manners of calculating such a recurrence formula are known.
Each reference pattern represents a dictionary words a phoneme, a part of a syllables a concatenation of words, a concatenation of spoken letters or numerals, or the like. For each input pattern feature vector, the vector distances are calculated a number of times given by a two-factor product of (the number of reference patterns).times.(the reference pattern lengths).
It is possible to compress the reference patterns and to reduce this number of times of calculation by vector quantization in the manner described in the Nakagawa book, pages 26 to 27. More particularly, similar reference pattern feature vectors are represented by a common representation at a certain reference pattern time instant. Several sequences of reference pattern feature vectors are thereby converted into a sequence of codes: EQU B=c.sup.1, c.sup.2, . . . , c.sub.j, . . . , and c.sup.J,
where c.sup.j represents a code book number given for the reference pattern feature vectors by a code book: EQU {b(1), b(2), . . . , b(k), b(K)} (4)
which is used to represent several reference pattern feature vectors approximately by a code book vector b(c.sup.j). When the vector quantization is resorted to, the number of times of calculation is only K times at each input pattern time instant t.
On the other hand, the hidden Markov models are described in the Nakagawa book, pages 40 to 46, 55 to 60, and 69 to 74 and are used to describe the reference patterns by introduction of a statistical idea in order to cope with various fluctuations in voice patterns. Parameters of the hidden Markov models are transition probability and output probability parameters. The transition probability parameters represent time sequential fluctuations of the voice patterns. The output probability parameters represent tone fluctuations of the voice patterns and are given by either a discrete probability distribution expression or a continuous probability distribution expression.
It is believed that the continuous probability distribution expression is superior to the discrete probability distribution expression. This is because the latter is adversely influenced by quantization errors. In the former, use is made of continuous mixture densities or distributions into which a plurality of element multi-dimensional Gaussian distributions are summed up with weights. It is possible to preliminarily calculate the transition and the output probability parameters by a forward-backward algorithm known in the art by using training data.
When the hidden Markov models are used, processes are as follows for recognition of the input pattern represented by Equation (1). It will be surmised that the output probability distribution expression is represented by the continuous mixture distributions. Denoting a transition probability by a.sub.ji, where i and j represent states of the hidden Markov models, a weight for mixture by .lambda..sub.im, where m represents an element number given to elements used in mixtures of the output probability distributions, and an average vector of each element Gaussian distribution by .mu..sub.im, a forward probability .alpha.(i t) is calculated by a recurrence formula: ##EQU3## for i=1, 2, . . . , I and t=1, 2, . . . , T, where I represents a final state. In the equation for the forward probability, a factor is given by: ##EQU4## where .SIGMA..sub.im represents a covariance matrix: ##EQU5## n representing the dimension of the Gaussian distributions.
For the input pattern, an ultimate forward probability .alpha.(I, T) gives a pattern likelihood P(x). At each input pattern time instant, a frame likelihood is given by calculating Nx; .mu..sub.im, .SIGMA..sub.im ! in accordance with Equation (5) a number of times given by a three-factor product of (the number of hidden Markov models).times.(the number of states of each hidden Markov model).times.(the number of mixtures).
In the manner described in the foregoing, an amount of calculation increases in a conventional pattern recognition device with the number of reference patterns and in their pattern length when the dynamic programming algorithm is used without the vector quantization. The amount of calculation increases with an increase in the number of quantization when the vector quantization is resorted to. The number of calculation increases also in a conventional pattern recognition device wherein the hidden Markov models are used when the number of states of the hidden Markov models and the number of mixtures are increased. Due to the increase in the amount of calculation in either case, the conventional pattern recognition device has been bulky and expensive. If the amount of calculation is suppressed to a low values the conventional pattern recognition device is not operable with satisfaction in its precision and accuracy.