There is a known method of using HMM (Hidden Markov Model) or DP (Dynamic Programming) matching for speech recognition. Both these methods are frequently used as basic technologies of speech recognition, an important problem in these systems for realizing expansion of vocabulary or continuous speech recognition and the like is to reduce the amount of calculation without deteriorating their functions. A method using vector quantization has already been proposed as a method of solving the problem. The present application concerns its improvement. Before entering the subject an explanation will first be given generally of HMM and DP matching and of how the technology of vector quantization is used.
HMM is considered to be a model for generating time-sequential signals in accordance with a stochastic property. In speech recognition using HMM, HMM r is provided in correspondence with a recognition unit r(=1, . . . ,R) such as a word, a syllable, a phoneme or the like (hereinafter, representatively word) to be recognized. When a vector series Y=(Y.sub.1,Y.sub.2, . . . ,Y.sub.T) (y.sub.t : a vector observed at a time point t) is observed, a degree of occurrence of Y is calculated from each HMM r and a word which corresponds to a HMM having the maximum degree is rendered the result of recognition.
FIG. 1 shows an example of HMM circle mark .smallcircle. designates a state of a system to be modeled by HMM, an arrow mark .fwdarw. designates a direction of transition of a state and q.sub.i designates a state i. Assume that a transition from a state i to a state j is caused by a probability a.sub.ij. The model is called a Markov chain when only a state and its transition probability are defined. By contrast, in HMM, a vector is assumed to generate on each state transition and .omega..sub.ij (y) is defined as a degree of occurrence of a vector y in accordance with a state transition q.sub.i .fwdarw.q.sub.j. There also are many cases expressed in .omega..sub.ij (y)=.omega..sub.ii (y) .omega..sub.i (y) or .omega..sub.ij (y)=.omega..sub.jj (y)=.omega..sub.j (y) in which y occurs not in accordance with state transition but with state. An explanation will be given in this application assuming that y occurs in accordance with state. Here, the structure of HMM, state transition probability and an occurrence probability of a vector are determined such that behavior of an object (speech pattern such as a word when used in speech recognition) to be modeled by HMM is explained as correctly as possible. FIG. 2 is an example of a constitution of HMM which is frequently used in speech recognition.
When HMM is defined, a degree L(Y.vertline..lambda.) whereby an observation vector series Y occurs from a model (designated by .lambda.) and can be calculated as follows. ##EQU1## where X=(X.sub.1,X.sub.2 . . . ,X.sub.T+1) designates a state series and .pi..sub.1 designates a probability of being in state i at t=1. In this model x.sub.t .epsilon.{1,2 . . . ,J,J+1) and x.sub.T+1 =J+1 is a final state. In the final state only a transition thereto occurs and there is no occurrence of vector.
HMM is grossly classified into a continuous type and a discrete type. In the continuous type, .omega..sub.i (y) is a continuous function such as a probability density function of y in which a degree of occurrence of y.sub.t is given as a value of .omega..sub.i (y) when y=y.sub.t. A parameter specifying .omega..sub.i (y) for each state i is defined and the degree of occurrence of y.sub.t at a state i is calculated by substituting y.sub.t into .omega..sub.i (y). For example, when .omega..sub.i (y) is given by a normal distribution of multiple dimensions, EQU .omega..sub.i (y)=N(y;.mu.i,.SIGMA.i) (Equation 2)
where parameters of the function specified by the state i are .mu.i and .SIGMA.i.
In the discrete type, an occurrence probability b.sub.im of a label m .epsilon.{1,2, . . . M} into which y.sub.t is to be transformed by vector quantization is stored in a table for each state i. The degree of occurrence of y.sub.t in the state i is b.sub.im when the label into which y.sub.t is transformed is m.
The vector quantization is performed by using a code book. In the code book, when the size is defined as M, feature vectors collected as learning samples are clustered into clusters of 1,2 . . . , M and representative vectors (also designated as average vector, centroide, code vector etc.) .mu.m of clusters m (=1,2 . . . , M) are stored is a searchable form. The L.B.G. Algorithm is a well known clustering method. The vector quantization of y.sub.t is performed by transforming y.sub.t into a label of centroide most proximate thereto. Accordingly, the degree of occurrence of y.sub.t at the state i is given by the following equation. ##EQU2## d(y.sub.t,.mu.m) is a distance between y.sub.t and .mu.m where various distances are considered as including an Euclidean distance.
FIG. 3 is a block diagram of a speech recognition device using a discrete type HMM. A feature extraction unit 301 transforms input speech signals into feature vectors at constant intervals of time (designated as frame), for example, at every 10 msec by a well-known method such as a filter bank, Fourier transformation, LPC analysis etc. Accordingly, the input voice signals are transformed into a series of feature vectors Y=(y.sub.1,y2, . . . y.sub.T). T designates a number of frames. A code book 302 holds representative vectors, each corresponding to each label in a form searchable by the label. A vector quantizing unit 303 substitutes (encode) each vector of the vector series Y into a label nearest thereto which corresponds to a representative vector registered in the code book. A parameter estimation unit 304 estimates, from learning samples, parameters of HMM corresponding to respective words of a recognition vocabulary. That is, in forming a HMM corresponding to a word r, the structure (number of states and its transition rule) of HMM is suitably determined first. Thereafter, the state transition probability or the occurrence probability of the label occurring in accordance with the state in the above-mentioned model is calculated in a manner that the occurrence degree of the label series obtained from pronouncing a word r a number of times becomes as high as possible. A HMM storing unit 305 stores HMMs provided thereby for respective words. A likelihood calculation unit 306 calculates the likelihood of the label series of unknown input speech, which should be recognized, for the above-mentioned label series of respective models stored in the HMM unit 305. A determination unit 307 determines the correspondence of a word to a model providing a maximum value of the likelihood of the above-mentioned respective model obtained by the likelihood calculation unit 306. In FIG. 3 broken lines designate the flow of signals in forming HMMs.
In the continuous HMM, the degrees of occurrence of observation vectors in respective states are given by the defined probability density function. This presents a problem wherein a large amount of calculation is necessary although the accuracy is higher than that in the discrete type. In contrast, in the discrete type HMM, there is an advantage as the amount of calculation necessary is very small and since the calculation of likelihood of a model with respect to the observation label series can be executed by reading the occurrence probabilities b.sub.im of the labels m(=1, . . . ,M) at respective states from a storing device storing them in relation to the labels. However, there is a drawback wherein the recognition accuracy is worse than that in the continuous type HMM due to an error associated with the quantization. To avoid this error it is necessary to increase the number of labels M (enlarge the size of the code book). However, with the increase, the number of learning samples necessary to obtain the model becomes very large. When the number of learning samples is insufficient, the estimated value of the above-mentioned b.sub.im may frequently be 0, with which a correct estimation cannot be performed.
For example, consider a case where there is a word speech pronouncing "Osaka" in the above-mentioned recognition vocabulary, and a model corresponding thereto is formed. Speech samples corresponding to the word "Osaka" pronounced by a number of speakers are transformed into a feature vector series and respective feature vectors are transformed into labels as mentioned above. In this way the respective speech samples with respect to the above-mentioned word "Osaka" are transformed into a label series corresponding thereto. A discrete type HMM corresponding to the word "Osaka" is formed from the obtained labels series by estimating the parameters {a.sub.ij, b.sub.im } of a HMM such that the likelihood with respect to these label series is maximized. The well-known Baum-Welch method or the like can be used in the estimation.
In this case all the labels existing in the code book are not always included in the label series of the learning samples corresponding to the word "Osaka". The occurrence probability of a label which does not appear in the label series of the learning sample is estimated to be "0" in the procedure of learning in the model corresponding to "Osaka." Accordingly, in a case in which a label that is not included in the label series used in forming the above-mentioned model of "Osaka" happens to be present in the label series corresponding to the word voice of "Osaka" pronounced at recognition time (this is sufficiently probable when the number of learning samples is small), the degree of occurrence of the label series of "Osaka" pronounced at recognition time from the above-learned model of "Osaka" becomes "0". However, even in such a case, although the label is different, as the feature vectors are considerably close to those of the voice samples used in learning the model before transforming into the labels, they should be recognized sufficiently as "Osaka" in the stage of vectors. Originally both vectors should resemble each other since the same word is pronounced. However it is sufficiently probably that, when these vectors with a small difference are in the vicinity of a boundary of a cluster of the labels, they are transformed into totally different labels. It is easily imaginable that such a matter adversely influences recognition accuracy. The larger the size M of the code book and the smaller the number of learning samples, the more frequently such a problem is caused.
One method for removing the drawback is a HMM based on fuzzy vector quantization, which we call FVQ/HMM. A multiplication type FVQ/HMM described in the Institute of Electronics, Information and Communication Engineers Technical Report SP93-27 (June, 1993) is worthy of notice as one showing excellent capability.
FIG. 4 is a block diagram explaining a general principle of FVQ/HMM. In the figure, broken lines show a flow of signals in forming HMM. A feature extraction unit 401 is similar to the part 301 in FIG. 3. A code book 402 is similar to the part 302 in FIG. 3. A membership degree calculation unit 403 transforms the feature vectors into membership degree vectors. The membership degree vector is a vector which has an element of a membership degree of the feature vector at each time point with respect to each cluster. Defining the feature vector at a time point t as y.sub.t, the above-mentioned clusters as C.sub.1, . . . ,C.sub.M and the membership degree of y.sub.t with respect to C.sub.m as u.sub.tm, y.sub.t is transformed into the membership degree vector u.sub.t =(u.sub.t1, . . . ,u.sub.tM).sup.T. Hereinafter, in this application a vector is a column vector and T at superscript designates transpose. Various definitions may be considered for u.sub.tm, for example, it may be defined as follows. (J. G. Bezdec: "Pattern Recognition with Fuzzy Objective Function Algorithm", Plenum Press, New York (1981).) ##EQU3## In this equation F&gt;1 and designates fuzziness defined as follows. ##EQU4## where .delta..sub.ij is the Kronecker delta in which .delta..sub.ij =1 if i=j and .delta..sub.ij =0 if i.noteq.j. Equation 5 signifies that, if the label of a cluster corresponding to a centroid that is nearest to y.sub.t is designated as O.sub.t, a membership degree of y.sub.t to the cluster is 1. Membership degrees thereof to the other clusters are 0 when F.fwdarw.1 (which becomes a normal vector quantization.) When F.fwdarw..infin., the membership degrees of y.sub.t to any clusters are 1/M which maximizes the fuzziness. When a posterior probability of C.sub.m with respect to y.sub.t can be calculated by a means such as a neural net or others, the posterior probability can be rendered the definition of the membership degree (hereinafter, both "posterior probability" and "membership degree" are designated as "membership degree").
In fact, the above-mentioned membership degree u.sub.tm is not calculated for all the clusters for reasons mentioned later, but the calculation is performed with respect to a cluster having a minimum d(y.sub.t,.mu.m) to a K-th smallest cluster (K-nearest neighbor). That is, the elements forming the above-mentioned membership degree vector u.sub.t have values calculated by Equation 4 with respect to clusters of higher orders K having larger membership degrees and] all other values are rendered 0. Numeral 404 designates a parameter estimation unit. A HMM storing unit 405 stores HMMs, each HMM corresponding to each recognition unit such as a word or a syllable to be recognized. A likelihood calculation unit 406 calculates a likelihood with respect to input speech of the above-mentioned respective HMM from the membership degree vector series, provided from an output of the above-mentioned vector quantization unit. That is, a degree L.sup.r of occurrence of feature vector series y.sub.1, . . . ,y.sub.T from the above-mentioned HMM r(r=1, . . . ,R). A determination unit 407 calculates the following equation with r* as a recognition result. ##EQU5##
The likelihood calculation unit 406 calculates the likelihood L.sup.r corresponding to the recognition unit r with respect to r=1, . . . ,R in accordance with Equation 1. Then various HMMs are defined in accordance with the above-mentioned definition of .omega..sub.i (y.sub.t). In the multiplication type FVQ/HMM shown here .omega..sub.i (y.sub.t) is defined as follows. ##EQU6##
In a multiple form, ##EQU7## As mentioned above, addition or multiplication with respect to m in Equation 7 is performed only with regard to K clusters having higher orders of membership degrees; in this case, Equation 7 becomes the following Equation 8. (hereinafter, explanation will be given of addition style) ##EQU8## where h(k) designates a cluster in which y.sub.t has a k-th highest membership degree. When the membership degree is defined by Equation 4, Equation 4 is calculated with regard to up to a k-th d(y.sub.t,.mu.m) in an order of smallness. In this case, u.sub.t,h(1) + . . . +u.sub.t,h(K) =1, and u.sub.t,h(K+1) = . . . =u.sub.t,h(M) =0. The addition in Equation 7 is performed only for K clusters having higher order membership degrees as in Equation 8 not only for reducing the amount of calculation but also for the following reason.
The FVQ type shows a higher recognition rate than the discrete type because of an interpolation effect of learning samples in estimating parameters. For example, consider a case where probabilities of occurrence of cluster A and cluster B in a state i is estimated by learning samples. In the discrete type, a vector to be quantized is classified to A even if it is near to B when it is on the side of A with little distance from the boundary, and to B when it is on the side of B with little distance therefrom. Accordingly, even if A and B are included in a population with the same rate, there is a segregation in learning samples and many included in A happen to be vectors in the neighborhood of the boundary between A and B. Therefore, it is probable that the probability of occurrence of A is estimated as larger than the probability of occurrence of B. Such a segregation of learning data easily occurs when the number of learning data with respect to the size of a code book is small. The recognition rate deteriorates when the learning samples and evaluation data are independent from each other. This is because the segregation is not necessarily in agreement with the trend of the evaluation data.
In the case of the FVQ type, not only A but B as well is assumed to occur in accordance with the membership degrees of the vectors and the probabilities of occurrence of these are calculated thereby. Therefore, with regard to the above-mentioned learning samples, although the probability of occurrence of A is estimated a little higher, the probability of occurrence of B is also estimated in accordance with the membership degree and an extreme estimation error as in the discrete type is not caused. That is, it can be said that an interpolation is performed with regard to the learning samples by the FVQ type; in other words, the number of learning samples is effectively increased. This is the reason that the recognition rate of the FVQ type is superior to the recognition rate of the discrete type, especially when the size of the code book is large.
The interpolation for the insufficiency of the number of learning samples by the FVQ type signifies that the number of the learning samples are increased equivalenty from the given learning samples per se, it is a little different from an actual increase in the number of the learning samples. Accordingly, it is sufficiently possible that, when the size of the code book decreases, the number of learning samples with regard to each cluster relatively increases and the estimation accuracy of b.sub.im is sufficiently improved. The recognition rate of the discrete type without the interpolation becomes higher or equal to that of the FVQ type with poor interpolation, depending on the manner of interpolation.
The degree of the interpolation is influenced by the value of K as well as the size of the code book or the fuzziness. With K approaching K=1, that is, approaching that of the discrete type, the influence of the interpolation becomes small. The influence of the interpolation becomes large with an increase in K. Therefore, the degree of interpolation can be controlled by K when the fuzziness is fixed. That is, an unnecessary increase in K does not necessarily imply improvement. Thus, there is an optimum value for K in accordance with the size of a code book. This optimal value arises in view of maximizing the amount of improvement of the recognition rate by the FVQ type over that of the discrete type. According to experiments, in recognizing 100 city names by an unspecified speaker, the optimum value was K=6 with regard to the size of a code book of 256 and the optimum value was K=3 with regard to the size of a code book of 16.
In this way, in the FVQ type as compared with the discrete type, although K-times calculation of the membership degree and K-times product sum calculation are increased, since it is necessary to calculate Equation 8 at recognition time, the recognition rate is improved compared with that of the discrete type and is equal to or improved compared with that of the case of the continuous type. Further, the amount of calculation is considerably reduced compared to that in the case of the continuous type.
A method called Forward-Backward method is used as a method for calculating Equation 1. However, to reduce the amount of caluation, the Viterbi method for calculating a maximum value with regard to X as an approximate solution of Equation 1 is often used. It is normal practice to use the method in the form of addition by logarithmic calculation.
That is, the calculation is performed by the following equation where L' is likelihood. ##EQU9## Equation 9 can effectively be calculated by the Dynamic Programming method. That is, L' is calculated by the following equations recurrently with respect to t=2, . . . T with .phi..sub.i (1)=log .pi..sub.i. ##EQU10## This is called the Viterbi method. Often, in forming a model, the Baum-Welch method (Forward-Backward method) is used and the Viterbi method is used in recognition since there is no significant difference between recognition results using L and L'. In the case of the multiplication type FVQ/HMM, when the Viterbi method is used in recognition, since b.sub.im is only used in the form of log b.sub.im. By storing not b.sub.im but log b.sub.im, logarithmic calculation is not necessary and the only product sum calculation can be executed in calculating Equation 7 or Equation 8.
Next, an explanation will be given of DP matching. The most basic method is performed by pattern matching among feature vector series. FIG. 5 shows a conventional example. Numeral 51 designates a feature extraction unit which is similar to the part 301 in FIG. 3. A reference pattern storing unit 53 stores reference patterns corresponding to words. The reference patterns are previously registered in the reference pattern storing unit 53 in correspondence with words to be recognized as ones transformed into feature vector series in the feature extraction unit 51. A broken line in FIG. 5 designates a connection used in the registering. The connection shown by the broken line portion is released at recognition time. A pattern matching unit 52 performs a matching calculation of each reference pattern stored in the reference pattern storing unit 53 with an input pattern and calculates a distance (or similarity degree) between the input pattern and each reference pattern. A determination unit 54 finds a word corresponding to a reference pattern and providing a minimum value (maximum value) of a distance (or similarity degree) between the above-mentioned input pattern and each reference pattern.
A more specific explanation will be given as follows. In this example an explanation will be given of calculating "distance" between patterns. (In a case based on "similarity degree", "distance" is replaced by "similarity degree" and "minimum value" is replaced by "maximum value") Now, assume that a feature vector outputted at a time point t in the feature extraction unit 51 is defined as y.sub.t, the input pattern series is defined as Y=(y.sub.1,y.sub.2, . . . y.sub.t) and a reference pattern corresponding to a word r is defined by the following equation. EQU Y.sup.(r) =(y.sup.(r).sub.1,y.sup.(r).sub.2, . . . , y.sup.(r).sub.J.sup.(r)) (Equation 12)
Further, assume that a distance from Y to Y.sup.(r) is defined as D.sup.(r) and a distance between y.sub.t and y.sup.(r).sub.j is defined as d.sup.(r) (t,j) (however, in the multiplication style the respectives are designated by D.sub.2.sup.(r) and d.sub.2.sup.(r) (t,j) and in addition style the respectives are designated by D.sub.1.sup.(r) and d.sub.1.sup.(r) (t,j). The following equations are calculated. ##EQU11## where EQU X=(x(1),x(2), . . . ,x(K)), EQU X*=(x*(1),x*(2), . . . ,x*(K))
The recognition result is shown by the following equation. ##EQU12##
In Equation 13, x(k)=(t(k),j(k)) is a k-th lattice point on a matching path X made from Y and Y.sup.(r) in a lattice graph (t,j) and w(x(k)) is a weighting coefficient on the above-mentioned distance at the lattice point x(k).
Hereinafter, a parallel discussion is established with respect to both multiplication style and addition style, and it is easy to transform addition style into expression in multiplication style (d.sub.1.sup.(r) (t,j)=log d.sub.2.sup.(r) (t,j),D.sub.1.sup.(r) =log D.sub.2.sup.(r) etc.). However as it is general to use addition style, an explanation is mainly given of addition style here (accordingly, suffixes 1 and 2 are omitted). Multiplication style is indicated only if necessary.
When a point series x(k.sub.1), . . . ,x(k.sub.2) from x(k.sub.1) to x(k.sub.2) is defined as X(k.sub.1, k.sub.2), x(K) is designated as x(K)=(t(K),j(K))=(T,J), Equation 13 signifies that a distance D.sup.(r) between Y and Y.sup.(r) is defined as a minimum value with respect to X(1,K) of an accumulation of weighted distances between respective feature vectors of an input pattern Y and a reference pattern Y.sup.(r) which correspond along a point series X(1,K). The calculation of Equation 13 can efficiently be performed by using the Dynamic Programming method by properly selecting weighting coefficients w(x(k)), which is called DP matching.
It is necessary for performing DP to establish the optimality principle. That is, "a partial process of an optimum process is also an optimum process" should be established. In other words, the following recurrent Equation 16 should be established with regard to the following equation 15. Reducing the considerable amount of calculations. ##EQU13##
The optimum process from a point x(1) to a point p.sub.o =x(k) is to find a point series (optimum point series) minimizing .phi.'(p.sub.o,X(1,k,)), when .phi.'(p.sub.o,X(1,k)) is a weighted accumulation distance along a point series X(1,k)=(x(1), . . . ,x(a)=p.sub.o). When the optimum point series is rendered X*(1,k)=(x*(1), . . . x*(k-1),x*(k)=p.sub.o) and .phi.'(p.sub.o,X*(1,k)) is rendered .phi.(p.sub.o), the above-mentioned optimality principle is established if the optimum point series from the point x(1) to the point x*(k-1) agrees with a point series from a point x*(1) to a point x*(k-1) on the point series X*(1,k). In other words, when a point series minimizing .phi.(x(k-1))+w(p.sub.o)d.sup.(r) (p.sub.o) is rendered (X*(1,k-1) =x*(1), . . . ,x*(k-1)) among optimum point series with x(1) as a start point and x(k-1) as a finish point, a point series up to x(k-1) in the optimum point series from x(1) to x(k)=p.sub.o agrees with X*(1,k-1). Accordingly, when optimum point series from various x(1) as start points to various x(k-1) as finish points are known and accordingly .phi.(x(k-1)) is known with regard to various x(k-1), the optimum point series from various x(1) to a specific x(k)=p.sub.o and the weighted accumulated distances along the optimum point series can be calculated by Equation 16. That is, the weighted minimum accumulation distance .phi.(x(k)) from the point x(1) to the point x(k) can be calculated in accordance with Equation 16 by using the weighted minimum accumulation distance .phi.(x(k-1)) as its successive step, calculating D.sup.(r) =.phi.(x(K)) recurrently with .phi.(x(1))=w(x(1))d.sup.(r) (x(1)) as an initial value. Therefore, the weighted minimum accumulation distance is calculated with an amount of calculation far less than calculating accumulation distances in all the allowable paths.
Here, examples of weighting coefficients are provided which can establish Equation 16, ##EQU14##
That is, when the weighting coefficient is specified by Equation 17 etc., the optimality principle is established and the dynamic programming method is applicable. (1) indicates a case where the total sum of the weighting coefficients is equal to a length of an input pattern (a number of frames), (2) indicates a case where the total sum of the weighting coefficients is equal to a length of a reference pattern and (3) indicates a case where the total sum of the weighting coefficients is equal to a sum of the input pattern length and the reference pattern length.
When example (1) of Equation 17 is used, the following Equation 18 is derived as a specific example of the recursive Equation 16. ##EQU15## where EQU .phi.(1, 1)=d.sup.(r) (1, 1) EQU D.sup.(r) =.phi.(x(K))=.phi.(I, J.sup.(r)).
By successively calculating Equation 18 with respect to t=1, . . . T,j=1, . . . ,J, Equation 13, that is, D.sup.(r) can be calculated. In this case, paths connectable to x(k) are restricted as shown in FIG. 6 That is, the path up to a point (t,j) passes either of three routes: point (t-2,j-1).fwdarw.point (t-1,j).fwdarw.point (t,j), point (t-1,j-1).fwdarw.point (t,j) and point (t-1,j-1).fwdarw.point (t,j), wherein a numerical value on each path designates a weighting coefficient each selected path. In this case w(x(1))+ . . . +w)x)K)) is equal to the number of input frames T. Accordingly, in this case the denominator of Equation 14 remains constant irrespective of the reference pattern. Therefore, when a calculation is performed to determine which input pattern is the nearest to which reference pattern, it is not necessary to normalize it by w(x(1))+ . . . +w(x(K)). In this case, as d.sup.(r) (t,j), a Euclidean distance or a city block distance that is more simplified, or the like, is often used.
In the above-mentioned matching calculation, the distance calculation or the similarity calculation between the feature vectors needs the largest amount of calculation. Especially, when the number of words is increased, the amount of calculation is increased in proportion thereto requiring time and posing a practical problem. A so-called "SPLIT method" using vector quantization has been devised to reduce the burden. (SPLIT: Word Recognition System Using Strings of Phoneme-Like Templates). (Sugamura, Furui, "Speech Recognition of Large Vocabulary Words by Phoneme-Like Reference Pattern", Trans. IEICE, vol. "D", J65-D, no.8, pp 1041-1048 (August, 1982)).
FIG. 7 is a block diagram showing the conventional example. A feature extraction unit 71 is similar to that in FIG. 3. A code book 73 stores representative vectors with M labels in a form searchable by the labels. A quantization unit 74 transforms output feature vectors y.sub.t of the feature extraction unit 71 into labels of clusters having centroids which are the nearest to y.sub.t by using the code book 73. A word dictionary 77 stores reference patterns of word speech to be recognized which have been transformed into a label series by the above-mentioned operation. The label has another name of phoneme-like. When a phoneme-like at a k-th frame of a reference pattern of a word r is defined as s.sup.(r) k, the word to be recognized in a form shown by the figure is registered in the form of phoneme-like series. J.sup.(r) designates a final frame (accordingly, a number of frames) of the reference pattern of the word r. Broken lines in the figure designate connections which are used only in the registering operation of the word to be recognized. A distance matrix calculation unit 72 calculates a distance from each output vector of the feature extraction unit 71 to a centroide of each cluster, transforms it into a vector with the distance as an element and transforms the feature vector series into a distance vector series. That is, a distance matrix. For example, a distance matrix 75 contains y.sub.t which is transformed into a distance vector (d(y.sub.t,.mu..sub.1), d(y.sub.t,.mu..sub.2), . . . ,d(y.sub.t, .mu..sub.M)).sup.T with a distance d(y.sub.t, .mu..sub.m) between the feature vector y.sub.t at a frame t and a centroide .mu.m of a cluster C.sub.m (d.sub.tm in FIG. 7). The distance is defined by the following equation when a city block distance is used. ##EQU16## where y.sub.tk is a k-th element of a vector y.sub.t and .mu..sub.mk is a k-th element of a centroide vector .mu..sub.m of C.sub.m. A matching unit 76 matches the distance matrix that is an output of the distance matrix calculation unit 62 with each word of the word dictionary and calculates the distance between them. Specifically, when s.sup.(r).sub.j =C.sub.m, a distance d.sup.(r) (t,j) between y.sub.t and s.sup.(r).sub.j is defined by the following equation by which Equation 18 is calculated. EQU d.sup.(r) (t,j)=d(y.sub.t,.mu..sub.m) (Equation 20)
That is, the difference between FIG. 7 and FIG. 5 lies in that d(y.sub.t,.mu..sub.m) which has previously been calculated in reference to the distance matrix is used in FIG. 7 in place of d.sup.(r) (t,j) in the conventional example of FIG. 5. Hence, the calculation can be performed quite similarly by using DP. A determination unit 78 calculates Equation 14 and finally provides a recognition result. In this case the denominator of Equation 14 has the same value as in FIG. 1 and it is not necessary to normalize it since w(x(1))+ . . . +w(x(K))=T, explained in the embodiment of FIG. 5
In the case of the conventional example of FIG. 5 the calculation of y.sub.t and y.sup.(r).sub.j is increased with an increase in the number of recognition words. However, in the case of the conventional example of FIG. 7 the amount of calculation of d.sup.(r) (t,j) remains unchanged irrespective of an increase in words since, once the distance matrix 75 has been calculated, the distance between y.sub.t and a phoneme-like is calculated only by referring to the distance matrix 75.
For example, consider a case where 100 words are recognized with an average of 50 frames for one word and 10 dimensions for a feature vector. In the case of FIG. 5, the number of reference pattern vectors for performing the distance calculation with respect to y.sub.t is on the order of 50.times.100=5000. When the distance is a Euclidean distance, the number of multiplication operations is provided by multiplying the number of distance calculations with respect to y.sub.t by 10 resulting in 50,000 times. In the case of FIG. 7, the distance calculation is performed with respect to y.sub.t and each centroid vector in a code book. Accordingly, when the number of clusters is M=256, 256 times the distance calculation is sufficient irrespective of the number of recognition words, and the number of multiplication operations is 2560. Note that the latter is 1/20 the number of operations of the former.
Further, although an explanation has been given here showing that the input feature vector series are transformed into the distance matrix, actually the distance vectors (d.sub.t1, . . . ,d.sub.tm).sup.T become unnecessary once a comparison thereof with phoneme-likes s.sup.(r).sub.j (r=1, . . . ,R; j=1,J.sup.(r)) respectively of the reference pattern has been finished. Accordingly, when the calculation of the distance vectors for every frame of input and the calculation of the recurrent equation have been performed with regard to all the reference patterns, d(y.sub.t,.mu..sub.j) needs not be stored as a matrix. For example, in the case of Equation 18, only the distance vectors for 2 frames of the current frame and the frame immediately before the current frame may be stored. Thereby reducing the amount of required storage.
The above-mentioned FVQ/HMM achieves a recognition rate which is equal to or better than that of the continuous type HMM and the amount of calculation is far smaller than that of the continuous type. However, in performing word spotting, the definition of .omega..sub.i (y.sub.t) cannot be the same as it is in the above-mentioned FVQ/HMM.
Further, although in the above-mentioned SPLIT method a considerably small amount of calculation is required compared with the amount required by a method that directly matches spectra, the problem of causing deterioration in recognition accuracy exists.