This invention relates to a continuous speech recognition system for automatically recognizing by the use of the technique of dynamic programming a word sequence substantially continuously spoken in compliance with a regular grammar, or the grammar of regular languages, known in the art.
A continuous speech recognition system is advantageous for use as a device for supplying data and/or program words to an electronic digital computer and a device for supplying control data to various apparatus. An example of the continuous speech recognition systems that are already in practical use, is disclosed in U.S. Pat. No. 4,059,725 issued to Hiroaki Sakoe, the present applicant and assignor to the present assignee. In order to facilitate an understanding of the instant invention, the system will briefly be described at first.
A continuous speech recognition system of the type revealed in the referenced patent recognizes a sequence of spoken word or words with reference to a predetermined number N of individually pronounced words, which are preliminarily supplied to the system as reference words. The word sequence is supplied to the system as an input pattern A defined by a time sequence of first through I-th input pattern feature vectors a.sub.i (i=1, 2, . . . , I) as: EQU A=a.sub.1, a.sub.2, . . . , a.sub.I. (1)
The reference words are selected to cover the words to be recognized by the system and are memorized in the system as first through N-th reference patterns B.sup.c (c=1, 2, . . . , N). An n-th reference pattern B.sup.n (n being representative of each of c) is given by a time sequence of first through J.sup.n -th reference pattern feature vectors b.sub.j.sup.n (j.sup.n =1, 2, . . . , J.sup.n) as: EQU B.sup.n =b.sub.1.sup.n, b.sub.2.sup.n, . . . , b.sub.J.sup.n. (2)
Merely for simplicity of denotation, the vectors will be denoted by the corresponding usual letters, such as a.sub.i and b.sub.j.sup.n, and the affixes c and n will be omitted unless it is desirable to resort to the more rigorous expressions for some reason or another. The input pattern feature vectors a.sub.i are derived by sampling the input pattern A at equally spaced successive instants i. Similarly, the reference pattern feature vectors b.sub.j are arranged at equally spaced sampling instants j. It is therefore possible to understand that the input and the reference pattern feature vectors a.sub.i and b.sub.j are on the respective time axes i and j. As the case may be, the reference pattern will be referred to merely as the reference words.
A fragmentary pattern A(u, m) is defined by: EQU A(u, m)=a.sub.u+1, a.sub.u+2, . . . , a.sub.m,
where u and m are called a start and an end point of the fragmentary pattern A(u, m). It is possible to select each of the successive instants i as the end point m. Usually, the start puint u is a previous instant that precedes the end point m in a sequence of successive instants i. The fragmentary pattern A(u, m) is named a partial pattern in the patent being referred to, without any difference from the partial pattern that will be described in the following.
At any rate, the fragmentary pattern A(u, m) is for comparison with each reference pattern B.sup.n. Inasmuch as a length or duration J.sup.n of the reference pattern B.sup.n is dependent on the reference word for that reference pattern B.sup.n, it is necessary to carry out the comparison with a plurality of fragmentary patterns selected from the input pattern A. It is conveienet for application of the dynamic programming technique to the comparison to temporarily set the end point m at each of the successive instants i and to vary the start point u relative to the end point m. It is sufficient that the start point u be varied within an interval defined by: EQU m-J.sup.n -r.ltoreq.u.ltoreq.m-J.sup.n +r, (3)
where r represents a predetermined integer that may be about 30.degree./o of the reference pattern length J.sup.n. The integer r is known as a window length or width in the art. Such fragmentary patterns having a common end point m and a plurality of start points u's in the interval (3) will herein be called a group of fragmentary patterns and designated by A(u, m)'s. In other words, the group of fragmentary patterns A(u, m)'s is defined, by each instant m and the previous instants u's, as those parts of the input pattern feature vector sequence which consist of (u+1)-th through m-th input pattern feature vectors a.sub.u+1 's to a.sub.m.
In order to quantitatively carry out the comparison, a group of similarity measures D(u, m, c) or D(A(u, m), B.sup.c) is calculated between each group of fragmentary patterns A(u, m)'s and every one of the reference patterns B.sup.c. It is convenient for this purpose to individually calculate a subgroup of similarity measures D(u, m, n)'s between the group of fragmentary patterns A(u, m)'s and each reference pattern B.sup.n. An elementary similarity measure D(u, m, n) between the fragmentary pattern A(u, m) and the reference pattern B.sup.n may be defined by: ##EQU2## where j(i) represents a monotonously increasing function for mapping or warping the reference pattern time axis j to the input pattern time axis i. The first and the last feature vectors a.sub.u+1 and a.sub.m of the fragmentary pattern A(u, m) should be mapped to the first and the last feature vectors b.sub.1.sup.n and b.sub.J.sup.n of the reference pattern B.sup.n under consideration, respectively. In Equation (4), d(i, j) represents Euclidean distance between an i-th input pattern feature vector a.sub.i and a j.sup.n -th reference pattern feature vector b.sub.j.sup.n. That is: EQU d(i, j)=.vertline.a.sub.i -b.sub.j.sup.n .vertline..
A partial similarity D&lt;u, m&gt; and a partial recognition result N&lt;u, m&gt; are calculated according to: ##EQU3## and ##EQU4## for each similarity measure group D(u, m, c). With the end point m successively shifted towards the input pattern end point I, partial similarities D&lt;u, m&gt;'s and partial recognition results N&lt;u, m&gt;'s are calculated and stored in memories at addresses specified by m and u.
It is possible to represent the input pattern A by various concatenations of partial patterns A(u(x-1), u(x)). A y-th partial pattern A(u(y-1), u(y)) in each concatenation is a fragmentary pattern A(u, m) having the start and the end points u and m at points or instants u(y-1) and u(y). The start point u(y-1) is the end point of a (y-1)-th partial pattern A(u(y-2), u(y-1)) in that concatenation. End points u(x) of x-th partial patterns A(u(x-1), u(x)) in a concatenation will be called x-th segmentation points of that concatenation.
The number of partial patterns A(u(x-1), u(x)) in a partial pattern concatenation will be designated by k. When the number k is equal to unity, the concatenation is the input pattern A per se. In general, a partial pattern concatenation consists of first through k-th partial patterns A(u(0), u(1)) or A(0, u(1)), A(u(1), u(2)), . . . , A(u(y-1), u(y)), . . . , and A(u(k-1), u(k)) or A(u(k-1), I).
One of such partial pattern concatenations would be indentical with a concatenation of those of the reference patterns B.sup.c which are representative of the word sequence under consideration. The number of segmentation points k in such a partial pattern concatenation will be named an optimum number and denoted by k. The segmentation points for the partial pattern concatenation are called first through k-th optimum segmentation points u(x) (x=1, 2, . . . , k). The zeroth and the k-th optimum segmentation points u(0) and u(k) are the input pattern start and end points O and I.
For each partial pattern cocatenation, a sum of the memorized partial similarities D&lt;u(x-1), u(x)&gt; is calculated. The optimum segmentation points u(x) are determined as a set of segmentation points u(x), k in number, that gives a minimum of such sums. Namely: ##EQU5##
Optimum partial recognition results n&lt;u(x-1), u(x)&gt; are selected by the use of the optimum segmentation points u(x) of the optimum number k from the memorized partial recognition results N&lt;u, m&gt;'s. A concatenation of the optimum partial recognition results n&lt;u(x-1), u(x)&gt; gives the results of recognition of the word sequence under consideration as an optimum concatenation of optimum ones of the reference words n(x), where x=1, 2, . . . , and k.
As described in the referenced patent, the algorithm for calculating Equation (4) according to the technique of dynamic programming is given by a recurrence formula for a recurrence coefficient g(i, j). The recurrence formula may be: ##EQU6## For each end point m, the recurrence formula (5) is calculated from j=J, successively through (J-1), (J-2), . . . , and 2, down to j=1. The initial condition is: EQU g(m, J)=d(m, J).
It is sufficient that the value of i be varied only within a window defined by: EQU j+m-J.sup.n -r.ltoreq.i.ltoreq.j+m-J.sup.n +r.
The subgroup of similarity measures D(u, m, n)'s for the end point m and the reference pattern B.sup.n, both under consideration, and for various start points u's in the interval (3), is thereby calculated according to: EQU D(u, m, n)=g(u+1, 1).
In the cited patent, the technique of dynamic programming is applied also to minimization of the above-described sums with respect to the numbers k's of partial patterns A(u(x-1), u(x)) in various concatenations and to the segmentation points u(x) in each of such concatenations. The latter technique of dynamic programming, as modified for implementation of the present invention, will become clear as the description proceeds.
On the other hand, recent trends in development of the continuous speech recognition systems are towards systems operable as automata for recognizing word sequences pronounced as regular languages. Systems operable as automata are described, for instance, by S. E. Levinson in "The Bell System Technical Journal," Vol. 57, No. 5 (May-June 1978), pages 1627-1644, under the title of "The Effects of Syntactic Analysis on Word Recognition Accuracy."
It is possible to understand that a continuous speech recognition system revealed in patent application Ser. No. 58,598 filed July 18, 1979, by Hiroaki Sakoe now U.S. Pat. No. 4,286,115, the instant applicant and assignor to the present assignee, is an approach to an automaton capable of recognizing regular language word sequences. The system disclosed in the referenced patent application is effective in raising the accuracy of recognition. The system is, however, operable only for recognition of word sequences for which transition of states of a finite-state automaton can take place along a single chain as will again be discussed later with reference to one of nearly a dozen of figures of the accompanying drawing. In other words, the system is operable according to a specific state transition diagram alone.
Incidentally, finite-state automata per se are written in "Computation; Finite and Infinite Machines" authored by Marvin Minsky and published 1967 by Prentice-Hall, Eaglewood Cliffs, N. J., pages 11-29.