This invention relates to a system for recognizing an input string of words which are substantially continuously uttered in compliance with a regular grammar. The system according to this invention is operable by resorting to a dynamic programming technique.
A connected word recognition system has a wide field of application and is in practical use in recognizing continuously uttered or spoken words. The continuously uttered words may, for example, be computer programs, sentences in business documents, and directions for airplane or ship control. It is already known to restrict an input string or chain of such words by a regular grammar or syntax in order to raise the accuracy of recognition. Implementation of the syntactical restriction as a connected word recognition system of high performance is, however, not so easy as will presently become clear.
According to U.S. Pat. No. 4,326,101 issued to Hiroaki Sakoe, the present applicant and assignor to the present assignee, an input string or chain of words selected from a word set and substantially continuously uttered in compliance with a regular grammar, is supplied to a connected word recognition system as an input pattern A. On selecting the words, repetition is allowed according to the grammar. When the input pattern A has an input pattern length or duration I in terms of frame periods, the pattern A is represented by an input sequence of first through I-th input pattern feature vectors a.sub.1 to a.sub.I which are time sequentially arranged in first through I-th frame periods, respectively, as: EQU A=a.sub.1, a.sub.2, . . . , a.sub.I.
Merely for simplicity of denotation, the vectors will be denoted in the following by usual letters, as a by a.
The word set is preliminarily selected so as to cover various input strings. The word set consists of a plurality of words, which are called reference words. It is possible to identify or designate the reference words by consecutive natural numbers. It will be assumed that the word set consists of first through N-th reference words 1 to N. An optional reference word will be referred to as an n-th reference word n.
The first through the N-th reference words 1 to N are memorized in a reference pattern memory as first through N-th reference patterns B.sup.1 to B.sup.N. An n-th reference pattern B.sup.n representative of the n-th reference word n, is given by first through J-th reference pattern feature vectors b.sub.1.sup.n to b.sub.J.sup.n as: EQU B.sup.n =b.sub.1.sup.n, b.sub.2.sup.n, . . . , b.sub.J.sup.n.
Depending on the circumstances, the affix "n" will be omitted. It will be presumed that the first through the N-th reference patterns B's have a common reference pattern length J merely for brevity of description and that the first through the J-th reference pattern feature vectors b.sub.1 to b.sub.J are successively arranged according to utterance of the n-th reference word n, namely, so as to represent the variation of the n-th reference pattern B with time.
The input string is correctly recognized, by using a finite-state automaton, as an optimum one of word concatenations, each of which is a string of words selected from the word set and concatenated in compliance with the grammar. A result of recognition is given by the optimum concatenation. In the following, a finite-state automaton will be referred to merely as an automaton.
In an article contributed in Japanese by Hiroaki Sakoe to a technical report published July 1980 by the Institute of Electronics and Communication Engineers of Japan, an automaton .alpha. is defined by: EQU .alpha.=&lt;K, .SIGMA., .DELTA., p.sub.0, F&gt;,
in which K represents a set of states p's. Like the reference words, the states p's will be identified by consecutive natural numbers, such as from 1 up to . In this event, the state set K is more specifically represented by {p.vertline.p=1, 2, . . . , }. .SIGMA. represents a word set {n.vertline.n=1, 2, . . . , N} of reference words 1 through N of the type described above. .DELTA. represents a state transition table {(p, q, n)}, where a combination (p, q, n) represents a transition rule or state transition from a state p to a state q which is in the state set. Furthermore, P.sub.0 represents an initial state and F, a set of final states at which the word concatenations can end.
The state p at which the n-th reference word n starts, is called a start state. The state q at which the n-th reference word n ends, is named an end state. The start and the end states of such a state pair may or may not be different from each other and need not be consecutively numbered among the natural numbers 1 through . An end state of a reference word is usually a start state of another reference word unless the end state in question is an element of the final state set F. The initial state is a start state of at least one predetermined reference word and will be denoted by 0 (zero).
Reverting to the above-specified Sakoe patent, a fragmentary or partial pattern A(u, m) of the input pattern A is defined by: EQU A(u, m)=a.sub.u+1, a.sub.u+2, . . . , a.sub.m,
where u and m are called a start and an end point or period and are selected so that 0.ltoreq.u&lt;m.ltoreq.I. If used in connection with the whole input pattern A, the start and the end points are an initial point or period 0 and a final point or period I.
A local distance D(A(u, m), B.sup.n) between the fragmentary pattern A(u, m) and the n-th reference pattern B.sup.n will be denoted by D(u, m, n). Attention will be directed to a group of fragmentary patterns A(u, m)'s which have a common end point m. For convenience of the following description, the natural number for each input pattern feature vector will be denoted by i and called a first natural number. The natural number for each reference pattern feature vector will be designated by j and named a second natural number. The distance is used as a similarity measure.
It is possible to calculate a group of local distances D(u, m, n)'s between each reference pattern B and the fragmentary pattern group A(u, m)'s by resorting to a dynamic programming technique. The expression "dynamic programming" is ordinarily abbreviated to DP. The local distance group D(u, m, n)'s is obtained by iteratively calculating a distance recurrence formula, which may be: ##EQU1## where g(i, j) is herein named a new recurrence value; g(i+1, j), g(i+1, j+1), and g(i+1, j+2) are called previous recurrence values, respectively; and d(i, j) represents an elementary distance .parallel.a.sub.i-b.sub.j .parallel. between an i-th input pattern feature vector a.sub.i and a j-th reference pattern feature vector b.sub.j.
Formula (1) is calculated, starting at an initial condition: EQU g(m, J)=d(m, J),
with the second natural number j successively varied from J down to 1 and by using the first natural numbers i's in an adjustment window: EQU j+m-J-r.ltoreq.i.ltoreq.j+m-J+r,
where r represents a predetermined positive integer called a window width or length in the art. The local distance group is given by: EQU D(u, m, n)=g(u+1, 1),
for the start points u's in: EQU m-J-r.ltoreq.u.ltoreq.m-J+r. (2)
Calculation of Formula (1) is repeated for the respective reference patterns B's. Groups of local distances thereby obtained will be again denoted by D(u, m, n)'s.
By resorting to a DP technique once more, an extremum recurrence formula is introduced as: EQU T(m, q)=min [T(u, p)+D(u, m, n)], n, p, u (3)
where (p, q, n).epsilon..DELTA.. In Formula (3), T(m, q) and T(u, p) will be named a new and a previous extremum. An initial condition: EQU T(0, 0)=0,
is set for Formula (2) in a first table memory, in which new extrema are successively stored for use, for the time being, as previous extrema. It is possible to calculate Formula (2) while local distances are calculated between each fragmentary pattern A(u, m) and the respective reference patterns B's.
Concurrently, the following substitution process is carried out: ##EQU2## where n, p, and u represent those particulor ones of the reference words n's, start states p's, and start points u's which give a new extremum. The particular word, start state, and start point are stored in second through fourth table memories, respectively.
Formulae (3) and (4) are calculated with the end point m varied from 1 up to I, when a final extremum T(I, q) is stored in the first table memory. Final values N(I, q), P(I, q), and U(I, q) are stored in the second through the fourth table memories. The result of recognition is obtained as will later be described.
It is possible to understand that the natural numbers i and j represent instants along first and second time axes i and j. A pair of instants (i, j) represents a grid or lattice point on an i-j plane.
It is to be noted that a considerable amount of calculation is necessary for Formula (1) and consequently for Formulae (3) and (4). This is because Formula (1) must be calculated for each end point m and for each reference pattern B by referring to a number of grid points even though the adjustment window is used. Incidentally, the process defined by Formulae (1) and (3) is to determine a new extremum T(m, q) at each instant m and for each end state q from first and second term enclosed with a pair of brackets on the right-hand side of Formula (3). The second term represents groups of local distances D(u, m, n)'s between every one of the reference patterns B's and a group of fragmentary patterns A(u, m)'s which have a common end point at the m-th instant m and a plurality of start points at previous instants u's related to that instant m according to Formula (2). The first term represents a plurality of previous extrema T(u, p)'s decided at the previous instants u's for a plurality of start states p's which are related to that end state q and the reference patterns B's by (p, q, n) .epsilon..DELTA..
On the other hand, an article is contributed by Cory S. Myer et al to IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume ASSP-29, No. 2 (April 1981), pages 284-297, under the title of "A Level Building Dynamic Time Warping Algorithm." Briefly speaking, the algorithm is for effectively carrying out the process defined hereinabove by Formulae (1) and (3).
For this purpose, a distance recurrence formula is iteratively calculated for each start state p of each reference word n. The recurrence formula may be: ##EQU3##
As will later be described with reference to one of twelve figures of the accompanying drawing, Formula (5) is calculated under a boundary condition: EQU g(u, 0)=T(u, p),
for the start state p under consideration. The start points u's are selected by provisionally assuming a range of the first natural numbers i's. The second natural number j is successively varied from 1 up to J. Until the second natural number j reaches a final point J of the reference word n in question, the first natural number i is varied from the start points u's towards those end points m's of the fragmentary patterns A(u, m)'s which will be called ultimate points.
Each time when a new recurrence value g(i, j) is calculated, the following substitution process is carried out under another initial condition: EQU h(u, 0)=u,
for a pointer or path value h(i, j): ##EQU4## if previous recurrence values g(i-1, j), g(i-1, j-1), and g(i-1, j-2) minimize the second term on the right-hand side of Equation (5), respectively. The meaning of the pointer h(i, j) will later become clear.
When Formulae (5) and (6) are calculated up to the final point J, ultimate recurrence values g(m, J) and ultimate pointers h(m, J) are obtained. Inasmuch as each ultimate value g(m, J) or h(m, J) is obtained for a start state p and a reference word n having that start state p and inasmuch as the J-th reference pattern feature vector b.sub.J corresponds to the end state q defined by a combination (p, q, n), it is possible to denote the values g(m, J) and h(m, J) by g.sub.p.sup.n (m, q) and h.sub.p.sup.n (m, q) depending on the circumstances. It is to be noted here that both p and q should be understood to represent natural numbers assigned, as p-th and q-th states in the state set K, to the start and the end states p and q of a state pair (p, q) of the n-th reference word n.
Such ultimate recurrence values g.sub.p.sup.n (m, q) and ultimate pointers h.sub.p.sup.n (m, q) are calculated for the respective reference words n's and for the start states p's which satisfy (p, q, n) .epsilon..DELTA.. Thereafter, an extremum: EQU T(m, q)=min [g.sub.p.sup.n (m, q)], n, p (7)
is calculated. At the same time, values N(m, q), P(m, q), and U(m, q) are decided according to: ##EQU5## where n and p represent the particular reference word and start state of the type described heretobefore. The pointer h.sub.p.sup.n (m, q) for the particular reference word n and the particular start state p will be called a particular pointer and be briefly denoted by n.
The algorithm satisfactorily reduces the amount of calculation. It is, however, impossible to prosecute the algorithm when the transition table .DELTA. of the automaton .alpha. includes a loop as will later be described. This is a serious defect of the Myer et al algorithm. In contrast, the connected word recognition system revealed in the above-referenced Sakoe patent is capable of dealing with loops.