The present invention relates to a continuous speech recognition apparatus, and particularly to an improvement in speech recognition accuracy affected by the detection of the starting and ending time points of continuously uttered speech.
In order to recognize continuously uttered speech a method has been used conventionally in which a connected reference pattern obtained by connecting a plurality of word reference patterns is matched with an input pattern (continuous speech) by use of dynamic programming. An order number of the sequentially connected reference pattern is expressed as a "digit" hereinafter. Words, syllables, semiwords, or clauses may be used as the reference patterns.
This method is based upon an assumption that the start and end points of the input pattern are previously determined by utilizing the power or spectrum of the input speech, but these are mistakenly detected in many cases due to a change in SN or a noise effect. When detection occurs erroneously, a silent portion may be added to the start or end portion of a word pattern in the input pattern, or the start or end portion of a word pattern in the input pattern may be cut, resulting in the likelihood of mistaken recognition.
In order to reduce the effect of this kind of speech detection error, a method is described in pages 1318 to 1325 of "Connected Spoken Digit Recognition by O(n)DP Matching" in The Transactions of The Institute of Electronics and Communication Engineers of Japan Vol. J66-D, No. 11 (1983), in which the start and end points of the input speech are not predetermined, input patterns and reference patterns being matched from the vicinity of the start point (i.sub.s1 to i.sub.s2) to the vicinity of the end point (i.sub.e1 to i.sub.e2).
A conventional method will first be discussed before this method is explained.
A speech pattern A produced by continuous utterance is expressed as EQU A=a.sub.1, a.sub.2, . . . a.sub.i, . . . a.sub.I ( 1)
which is called an input pattern. On the other hand, a reference pattern EQU B.sup.n =b.sub.1.sup.n, b.sub.2.sup.n, . . . b.sub.j.sup.n, . . . b.sub.J.sup.n ( 2)
is prepared for each word n, which is called a word reference pattern. DP-matching is performed between a connected speech reference pattern C=B.sup.n1, B.sup.n2, . . . , B.sup.nx obtained by connecting X word reference patterns and the input pattern A to calculate a measure of difference (distance) between the two patterns. This difference is called the "dissimilarity measure". A word sequence giving a minimum dissimilarity measure D(A, C) shown by the following equation is considered as a recognition result. ##EQU1## wherein the minimum dissimilarity measure is determined by a dynamic programming algorithm described below. This algorithm is the so-called VLB (Clockwise DP) algorithm.
The initial condition is: ##EQU2## A recurrence formula (6) is successively calculated for each digit from i=1 to I based on the boundary conditions shown by equation (5), wherein T(i,p) denotes the cumulative dissimilarity to the p-th digit when calculated to the i-th frame of the input pattern. This is called a digit dissimilarity measure. G(p,n,j) denotes a cumulative dissimilarity measure to the j-th frame of word n on the (p+1)th digit, which is called the temporary dissimilarity measure. For the n-th word of the reference pattern on the (p+1)th digit, under the boundary conditions: ##EQU3## the following recurrence equations are calculated from j=1 to J.sup.n (from the start point to the end point of the n-th reference pattern), ##EQU4## where j is j', giving a minimum G(p,n,j') on the right side of the equation (6), H(p,n,j) indicates the start point of word n on the (p+1)th digit and is called the temporary start point indicator, and H'(p,n,j) is H(p,n,j) at the frame prior to one frame of the input pattern. Having been obtained in this names, g(j) and H(p,n,j) are stored as G(p,n,j) and H(p,n,j) respectively; wherein d(j) is the distance between feature vector a.sub.i at input pattern time (frame) i and feature vector b.sub.j.sup.n at the n-th reference pattern time (frame) j this can be determined, for example, as Chebyshev distance: ##EQU5##
For the sake of minimization at the boundary of words, equation (9) is calculated: ##EQU6## wherein L(i,p+1) is the p-th digit start point when calculated to the i-th frame of the input pattern.
That is to say, the recurrence equation (6) is calculated for each pair (p,n) on one digit along the reference pattern time axis. This calculation is performed to the end point i=I along the input pattern time axis.
Recognition results of the input pattern are obtained according to the following procedure: ##EQU7## If p.noteq.0, the processing of equation (11) is repeated under the conditions p=p-1 and i=l. If p=0, the processing is completed.
As shown in the equation (3), the dissimilarity measure between the input pattern A and the reference pattern C is increased by distance d(i,j) therebetween for one frame calculation movement along the input pattern axis. This calculation of the dissimilarity measure is started, as shown in the equation (4), by adding the dissimilarity measure obtained at the first frame of the input pattern when the digit dissimilarity measure at the 0-th frame T(0,0) is zero. Thus the cumulation to the I-th frame at the end of the input pattern is performed. In this case, as shown in FIG. 1, the calculation of DP-matching is started after the start (i=1) and end (i=I) points have been previously set.
On the other hand, in the method described in the reference it is assumed that an initial value of the digit dissimilarity measure at the temporary starting point i=1 is T(0,0) and, if the start point i.noteq.1, the initial value is T(i,0)=d.delta.x i in the vicinity of the temporary starting point i.sub.S1 .ltoreq.i.ltoreq.i.sub.S2 as a penalty due to deviation from the start point i=1. This makes it possible to allow an unfixed start point in the matching, so that the start point may be in the range of i.sub.S1 .ltoreq.i.ltoreq.i.sub.S2.
However, if d.delta.=0, the initial value becomes zero in the vicinity of the temporary starting point i.sub.S1 .ltoreq.i.ltoreq.i.sub.S2. The number of times d(i,j) is added is reduced when the start point is nearer i.sub.S2. Conversely, if d.delta.=.infin., the starting point is allowed to be just i=1 and does not become a free starting point. Thus, it is necessary to set the value of d.delta. close to the average value of d(i,j), but the average value of d(i,j) depends upon the speaker and the words used, which causes a problem with determining d.delta. appropriately.