The present invention relates to a continuous speech recognition method and device for automatically recognizing a plurality of concatenated words such as numericals and for producing an output in accordance with the recognized content.
Speech recognition devices have been considered to be effective means for performing man/machine communication. However, most of the devices which have been developed so far have a disadvantage in that only isolated or discrete words can be recognized, so that data input speed is very low. In order to solve the above problem, a continuous speech pattern device which uses a two-level dynamic programming (to be referred to as a two-level DP) algorithm is described in Japanese Patent Disclosure No. 51-104204.In the principle of this algorithm, pattern strings obtained by concatenating several reference patterns in all possible orders are defined as reference pattern strings of a continuous speech. An input pattern as a whole is mapped onto the reference pattern strings. The number of reference patterns and their arrangement are determined to maximize the overall similarity measure between the input pattern and the reference pattern strings. Thus, speech recognition is performed. In practice, maximization is achieved by two stages of maximization; of individual words and of word strings. These maximization can be performed utilizing the DP algorithm.
The two-level DP algorithm will be described in detail below. Let feature vectors .alpha..sub.i be EQU (a.sub.1i, a.sub.2i, . . ., a.sub.Qi) (1)
then, a speech pattern A is defined as a time series of .alpha..sub.i : EQU A=(.alpha..sub.1, .alpha..sub.2, .alpha..sub.3, . . . , .alpha..sub.i, . . . , .alpha..sub.I) (2)
where I is the duration of the speech pattern A, and Q is the number of components of the feature vectors. Thus, the speech pattern A is regarded as the input pattern.
Assume that N reference patterns B.sup.n (n=1, 2, . . ., N) are defined as a set of words to be recognized. Each reference pattern B.sup.n has J.sub.n feature vectors as follows: EQU B.sup.n =(.beta..sub.1.sup.n, .beta..sub.2.sup.n. . . ., .beta..sub.j.sup.n, . . ., .beta..sub.Jn.sup.n) (3)
where the feature vector .beta..sub.j.sup.n is a similar vector to the feature vector .alpha..sub.i, as follows: EQU .beta..sub.j.sup.n =(b.sub.1j.sup.n, b.sub.2j.sup.n b.sub.Qj.sup.n) (4)
The partial pattern of the input pattern A which has a starting point l and an endpoint m on the time base i can be expressed as follows: EQU A.sub.(l, m) =(.alpha..sub.l, .alpha..sub.l+1, . . ., .alpha..sub.i, . . ., .alpha..sub.m) (5)
for 1.ltoreq.l &lt;m .ltoreq.I
Between the partial pattern A.sub.(l, m) and the reference pattern B.sup.n, a function j(i) which establishes a correspondence between a time base i of the input pattern and a time base j of the reference pattern is optimally determined. Partial matching is performed wherein a maximum value S(A.sub.(l, m), B.sup.n) of the sum of similarity measures s(.alpha..sub.i, .beta..sub.j.sup.n) (to be referred to as s.sub.n (i, j)) between vectors which are defined by i and j(i) is computed by the DP algorithm. In the first stage, a partial similarity measure S&lt;l, m&gt; as the maximum value of S(A.sub.(l, m), B.sup.n) is determined for n which is computed by sequentially changing the starting point l and the endpoint m, and a partial determination result n&lt;l, m&gt; for providing the maximum value is also determined. Overall matching is performed at the second stage wherein the number Y of words included in the input pattern and boundaries l.sub.(1), l.sub.(2), . . ., l.sub.(Y-1) which number (Y-1) are optimally determined, and wherein the number Y of words and boundaries l.sub.(1), l.sub.(2) l.sub.(Y-1) are obtained to maximize the sum of the partial similarity measures during continuous and nonoverlapping duration. The sum is given by the following relation: EQU S&lt;1, I&gt;=S&lt;1, l.sub.(1) &gt;+S&lt;l.sub.(1) +1, l.sub.(2) &gt;+S&lt;l.sub.(Y-1) +1, I&gt;(6)
The boundaries l.sub.(1), l.sub.(2), . . ., l.sub.(Y-1) and the partial determination result n&lt;l, m&gt; determine n&lt;1, l.sub.(1) &gt;, n&lt;l.sub.(1) +1, l.sub.(2) &gt;, . . ., n l.sub.Y-1) +1, I&gt;
The definition of the similarity measure is given by a function which maps the time base j of a reference pattern B and the time base i of the input pattern A in order to correct deviation between the time bases of the input pattern A given by relation (2) and the reference pattern B given by relation (3) as follows: EQU j=j(i) (7)
Assume that the similarity measure s(i, j) is exemplified by the following relation: ##EQU1## The similarity measure between the input pattern A and the reference pattern B is given as follows: ##EQU2## It is impossible to obtain a maximum value of relation (9) by computing all the possibilities for j=j(i). Instead, the DP algorithm is utilized as follows. Let the initial conditions be: ##EQU3## g(I,J) is computed in a range of i=2 to I and j=1 to J by the following recursive relation: ##EQU4## Therefore, S(A, B) of relation (9) is given by: EQU S(A, B) =g(I, J) (12)
The deviation of the time base in practice may not exceed 50% in practice, so that a hatched region bounded by lines 11 and 12 and a line 15 indicated by "i=j" in FIG. 1 need only be considered. Therefore, recursive relation (11) need only be applied in the range: EQU j-r .ltoreq.i .ltoreq.j+r (13)
The above hatched region is called an adjustment window.
The partial similarity measure of the endpoint m in the range indicated by reference numeral 14 in FIG. 1 is obtained in correspondence with one starting point l. The hatched region in FIG. 1 is defined by all the computations "(2* r+1) * J.sub.n " for one starting point.
When relation (13) is used as a condition for the alignment range of time bases i and j, the total computation C.sub.1 by the DP algorithm for the similarity measure s(i, j) is aoproximated as follows, even if only the partial similarity measure "S&lt;l, m&gt;" is to be obtained in the first stage: EQU C.sub.1 =(2 * r+1) * J * I * N (14)
where I is the duration of the input pattern, N is the number of reference patterns, and J is the average duration of the reference patterns. For the second stage, data of the partial similarity measure "S&lt;l, m&gt;" and the partial determination result "n&lt;l, m&gt;" must be stored. The storage capacity M.sub.1 is obtained by the following approximation: EQU M.sub.1 =(2 * r +1) * I * 2 (15)
If the following conditions are given: EQU I=120, N=50, J=35 and r=12 (16)
when ##EQU5## In order to manufacture a real time speech recognition device which provides a recognition result within 0.5 seconds after an utterance is completed, a total of "5,250,000" computations must be completed within 2.3 seconds (=0.5 +120.times.0.015), provided that the durations I and J are respectively 15 msec and the full duration from the utterance to the response is used for computation. Thus, high speed computation of about 0.4 .mu.sec for each computation is required. Even if parallel processing is performed, a large scale device is needed, resulting in high cost.