This invention relates to a similarity calculator for use in a pattern recognition system in calculating a similarity measure between two or more patterns according to the technique of dynamic programming.
Among the pattern recognition systems, those capable of recognizing spoken words and called speech recognition systems are widely required as input devices as, for example, for supplying data to electronic digital computers and control data to automatic classification apparatus, and have already been developed into commercial products. The speech recognition systems are classified into two classes. Those of one class are known as continuous speech recognition systems and are capable of carrying out the recognition even when a plurality of words are continuously spoken as a word sequence. Those of the other class are capable of carrying out the recognition only when the words are spoken word by word. The continuous speech recognition systems are more excellent than the speech recognition system of the other class because of easiness of pronouncing the words and a higher speed of operation. It is, however, not readily feasible to make a speech recognition system recognize a sequence of continuously spoken words. In fact, a greater number of process steps are necessary for recognition of continuously spoken words as will presently be described. The continuous speech recognition systems have therefore been bulky and expensive.
An example of the continuous speech recognition systems that are already in practical use, is disclosed in U.S. Pat. No. 4,059,725 issued to Hiroaki Sakoe, the present applicant and assignor to the present assignee. In order to facilitate an understanding of the instant invention, the system will briefly be described in the following.
A continuous speech recognition system of the type revealed in the referenced patent recognizes a sequence of spoken word or words with reference to a predetermined number N of individually spoken words, which are preliminarily supplied to the system as reference words. The word sequence is supplied to the system as an input pattern A given by a time sequence of first through I-th input pattern feature vectors a.sub.i (i=1, 2, . . . , I) as: EQU A=a.sub.1, a.sub.2, . . . , a.sub.I. (1)
The reference words are selected to cover the words to be recognized by the system and are memorized in the system as first through N-th reference patterns B.sup.c (c=1, 2, . . . , N). An n-th reference pattern B.sup.n (n being representative of each of c) is given by a time sequence of first through J.sup.n -th reference pattern feature vectors b.sub.j.sup.n (j.sup.n =1, 2, . . . , J.sup.n) as: EQU B.sup.n =b.sub.1.sup.n, b.sub.2.sup.n, . . . , b.sub.J.sup.n. (2)
Merely for simplicity of denotation, the vectors will be denoted by the corresponding usual letters, such as a.sub.i and b.sub.j.sup.n, and the affixes c and n will be omitted unless it is desirable to resort to the more rigorous expressions for some reason or another. The feature vectors a.sub.i and b.sub.j are derived by sampling the input pattern A and the reference patterns B at equally spaced sampling instants. It is therefore possible to understand that the input and the reference pattern feature vectors a.sub.i and b.sub.j are arranged along the respective time axes i and j at equal interval.
For the system disclosed in the above-referenced patent, a fragmentary pattern A(u, m) is defined by: EQU A(u, m)=a.sub.u+1, a.sub.u+2, . . . , a.sub.m,
where u and m are called a start and an end point of the fragmentary pattern A(u, m). The fragmentary pattern A(u, m) is named a partial pattern in the patent being referred to, equally as the partial patterns that will be described in the following. Obviously: EQU 0.ltoreq.u&lt;m.ltoreq.I.
A group of similarity measures S(A(u, m), B.sup.c) is calculated between each fragmentary pattern A(u, m) and the reference patterns B.sup.c. A similarity measure S(A(u, m), B.sup.n) between the fragmentary pattern A(u, m) and each reference pattern B.sup.n is calculated according to, for example: ##EQU1## where j(i) represents a monotonously increasing function for mapping or warping the reference pattern time axis j to the input pattern time axis i. Inasmuch as the first and the last feature vectors a.sub.u+1 and a.sub.m of the fragmentary pattern A(u, m) should be mapped to the first and the last feature vectors b.sub.1 and b.sub.j of the reference pattern B.sup.n under consideration: EQU j(u+1)=1
and EQU j(m)=J.
In Equation (3), s(i, j) represents the scalar product of an i-th input pattern feature vector a.sub.i and a j-th reference pattern feature vector b.sub.j of the reference pattern B.sup.n. Namely: EQU s(i, j)=(a.sub.i .multidot.b.sub.j)
or EQU s(i, j.sup.n)=(a.sub.i .multidot.b.sub.j.sup.n).
A partial similarity S&lt;u, m&gt; and a partial recognition result n&lt;u, m&gt; are calculated according to: EQU S&lt;u, m&gt;=.sup.max.sub.c S(A(u, m), B.sup.c)
and EQU n&lt;u, m&gt;=arg .sup.max.sub.c S(A(u, m), B.sup.c),
for each similarity measure group S(A(u, m), B.sup.c). Groups of similarity measures S(A(u, m), B.sup.c)'s are successively calculated for the end point m under consideration, with the start point u varied throughout an interval U(m) given by: EQU m-J-r.ltoreq.u.ltoreq.m-J+r, (4)
where r represents a predetermined integer that is selected to be about 30.degree./o of the shortest one of the reference pattern length or duration (min J.sup.c) and is called a window length or width in the art. A group of partial similarities S&lt;u, m&gt;'s and another group of partial recognition results n&lt;u, m&gt;'s are calculated for such groups of similarity measures S(A(u, m), B.sup.c)'s. With the end point m successively shifted towards the input pattern end point I, partial similarities S&lt;u, m.ltoreq.'s and partial recognition results n &lt;u, m&gt;'s of various groups are stored in memories at addresses specified by u and m. As the case may be, the groups of similarity measures S(A(u, m), B.sup.c)'s will be called a set of similarity measures and the fragmentary patterns A(u, m)'s having the common end point m and the start points u's which are predetermined relative to the end point m or, more specifically, varied throughout the interval U(m), a set of fragmentary patterns.
On the other hand, it is possible to represent the input pattern A by various concatenations of partial patterns. An x-th partial pattern in each concatenation is a fragmentary pattern A(u, m) having the start and the end points at points u(x-1) and u(x). The number of partial patterns in each concatenation will be designated by y. When the number y is equal to unity, the concatenation is the input pattern A per se. The point u(x) and others for each concatenation are called x-th and like segmentation points. The first through the y-th partial patterns of each concatenation are thus represented by A(u(0), u(1)) or A(0, u(1)), A(u(1), u(2)), . . . , A(u(x-1), u(x)), . . . , and A(u(y-1), u(y)) or A(u(y-1), I). One of the concatenations would be a concatenation of those of the reference patterns B which are representative of the actually spoken word sequence. The segmentation points for such a partial pattern concatenation are named optimum segmentation points and denoted by u(x) (x=1, 2, . . . , y). The zeroth and the y-th optimum segmentation points u(0) and u(y) are the input pattern start and end points 0 and I.
For each partial pattern concatenation, a sum of the memorized partial similarities S&lt;u(x-1), u(x)&gt;'s is calculated. The optimum segmentation points u(x)'s are the segmentation points that give a maximum of such sums, namely: ##EQU2## Referring to the memorized partial recognition results n&lt;u, m&gt;'s by the optimum segmentation points u(x)'s, the word sequence is recognized as a concatenation of optimum ones of the reference patterns n&lt;u(x-1), u(x)&gt;'s.
It is very desirable to apply the known technique of dynamic programming to calculation of Equation (3). As will later be described in detail with reference to one of nearly ten figures of the accompanying drawing, the algorithm for the dynamic programming is given by a recurrence formula for a recurrence value g(i, j), which is called a recurrence coefficient in the patent cited hereinabove. The recurrence formula may be: ##EQU3## For each value of the end point m, the recurrence formula (5) is calculated from j=J, successively through (J-1), (J-2), . . . , and 2, to j=1. The initial condition is: EQU g(m, J)=s(m, J).
It is sufficient that the value of i be varied only in a window defined by: EQU j+m-j-r.ltoreq.i.ltoreq.j+m-J+r. (6)
A subset of similarity measures S(A(u, m), B.sup.n)'s for the end point m and the reference pattern B.sup.n, both under consideration, and for various start points u's in the interval U(m) defined by Equation (4) is thereby calculated in parallel according to: EQU S(A(u, m), B.sup.n)=g(u+1, 1).
A similarity calculator according to the present invention is for carrying out the dynamic programming of the type exemplified above. In the cited patent, the technique of dynamic programming is applied also to maximization of the above-identified sum as regards the segmentation points u(x)'s and the number y of the partial patterns A(u(x-1), u(x))'s. With this, real time processing is rendered possible for each subgroup of partial similarities S&lt;u, m&gt;'s and partial recognition results n&lt;u, m&gt;'s derived for each end point m and memorized in the memories with the addresses specified only by the start points u's. The latter technique of dynamic programming is, however, out of scope of the instant invention and will not be described any further.
During calculation of each similarity measure subset S(A(u, m), B.sup.n)'s according to the recurrence formula (5), the value of j is varied from J down to 1 as described. For each value of j, the value of i is varied in the window (6). When the reference pattern duration J and the window width r are equal to 20 and 10, respectively, it is necessary to calculate the recurrence formula (5) about four hundred times for each end point m. It is therefore mandatory to calculate the recurrence formula (5) as many as forty thousand times to derive each subgroup of partial similarities S&lt;u, m&gt;'s and partial recognition results n&lt;u, m.ltoreq.'s when the number N of reference patterns B or words is about 100. For real time processing, the end point m must be shifted towards the input pattern end point I at the sampling period, which is usually about 10 milliseconds. The recurrence formula (5) must therefore be calculated at a rate of four times per microsecond.
In connection with the above, it is worthwhile to note that a major part of calculation of the recurrence formula (5) is for the scalr product s(i, j). This is because each feature vector a.sub.i or b.sub.j is of about ten dimensions, namely, consists of about ten vector components. Multiplication must therefore be carried out about ten times, followed by addition, for calculation of each scalar product s(i, j). On the other hand, comparison is repeated three times for maximization in the recurrence formula (5). The rest of calculation is to add the scalar product s(i, j) and the maximum. The multiplication must therefore be carried out at as high a rate as forty times per microsecond.
It has therefore been unavoidable that the similarity calculator for use in a continuous speech recognition system of the type revealed in the above-cited patent should comprise elements that are operable at high speed and are accordingly expensive. Alternatively, the similarity calculator has had to carry out the calculation in parallel. In any event, the similarity calculator has been bulky and expensive.
Incidentally, the input pattern feature vectors a.sub.i are supplied to a calculator for the scalar product s(i, j) from an input pattern buffer. In order to calculate the recurrence formula (5) in the window (6) with the value of j varied from J down to 1 for each end product m, it has been necessary to make the input pattern buffer hold a plurality of input pattern feature vectors a.sub.i at a time, for example, about (J+r) in number.