This invention relates to a speech recognition apparatus, and more particularly to a speech recognition apparatus capable of, in an environment where much noise is present, clearly recognizing the speech of a talker, distinguishing it from the noise.
In the speech recognition means that have been put into practical use, pattern matching is performed by comparing an input pattern of the utterance of a talker with the reference patterns of registered words. When the input pattern matches a reference pattern, it is recognized as the registered word.
The pattern matching thus far used will be outlined.
If a parameter representing the feature of voice at time point "i" is designated by vector a.sub.i, the input pattern A is expressed by the time series of the feature vectors: EQU A=(a.sub.1, a.sub.2, . . . , a.sub.i, . . . , a.sub.I) (1)
where I is a parameter for the time duration of input speech pattern A.
Supposing that the reference pattern of word "n" as previously registered is "B.sup.n " (n=1 . . . N), the reference pattern of registered word "n" is made up of feature vectors of J.sub.n each of which is similar to the feature vector a.sub.i of the input pattern, and is mathematically expressed: EQU B.sup.n =(b.sub.1.sup.n, b.sub.2.sup.n, . . . , b.sub.j.sup.n . . . , b.sub.J.sup.n n) (2)
In general, time duration I of input pattern A is not necessarily be equal to time duration J.sub.n of reference pattern B.sup.n. For this reason, for the actual matching, a function j(i) is first formed, which optimumly approximates the time base "i" of the input pattern and the time base "j" of the reference pattern. Then, a maximum value S(A, B.sup.n) of the sum of the vector similarity measures s(a.sub.i, b.sub.j.sup.n (i)) as defined by the time bases i and j(i) is computed for each "n". As a result, it is judged that the reference pattern providing the maximum value corresponds to the registered word with the highest similarity measure for the input pattern A. Then, it is selected as a recognized word.
There are cases that noise is present continuous to the top and end of the meaning voice or the sound irrelative to the meaning voice, such as a lisp of a talker, is input before and after the utterance. In such cases, it is impossible to have a high speech recognition performance by a simple pattern matching process, which is based on the reference pattern B.sub.n and the input pattern A.
To cope with this problem, there is known, for example, "Speech Recognition Apparatus", disclosed in Japanese Patent Disclosure No. S58-181099. In this speech recognition apparatus, a correlation between the input speech signals from two speech input means is worked out, to distinguish the understandable voice from the noise contained in the input voice. The result gives the understandable input voice.
However, this recognition means requires two speech input means. Further, this means is designed on the basis of the fact that the noise is equally input from the two speech input means. The local noise, for example, is not allowed for. Therefore, this means is not only complicated in instruction but also still involves the following problem for the improvement of the speech recognition performance.
In an environment where noise is contained in the input speech, it is essentially difficult to completely separate the voice from noise, and to extract only the voice. Therefore, an error which is caused at the time of distinguishing the meaning voice and the noise occurring at the extraction stage, possibly causes the recognition error.
The difficulty of speech recognition in the noisy environment will be described further in detail.
Suppose that the input pattern containing noise is given by the expression (1) above. Of the input pattern, the partial pattern corresponding to the understandable voice is expressed as the partial pattern with the starting point of time point i =l and endpoint of time point i=m, and is mathematically expressed by EQU A.sub.(l, m) =(a.sub.l, a.sub.l+1, . . . , a.sub.i, . . . , a.sub.m) (3)
(1=.ltoreq.l&lt;m.ltoreq.I)
The input pattern A as shown in FIG. 8B includes the partial patterns composed of only noise, which are mathematically expressed: EQU A.sub.(1, l-1) =a.sub.1, a.sub.2, . . . , a.sub.l-1 ( 4) EQU A.sub.(m+1, I) =a.sub.m+1, a.sub.m+2, . . . , a.sub.I ( 5)
The input pattern A with noise is expressed: EQU A=A(1, l-1).sym.A(l, m).sym.A(m+1, I) (6)
The operator .sym. means merely to arrange the feature vectors of each partial pattern time sequentially. Therefore, the input pattern expressed by expression (5) is similar to that of expression (1).
To obtain a similarity measure between the input pattern A and the reference pattern B.sup.n shown in FIG. 8A computed by the conventional pattern matching, the pattern, which contains the partial patterns A.sub.(1, l-1) and A.sub.(m+1, I) composed of only noise contained in the input pattern, and is different from the reference pattern, is used for the pattern matching. Therefore, the similarity measure obtained is essentially small.
If the improvement is made as in Japanese Patent Disclosure No. 58-181099, it is impossible to exactly separate the partial pattern A.sub.(l, m) corresponding to the understandable voice. Therefore, it can only be separated as indicated by a partial pattern shown in FIG. 8C and expressed by: EQU A.sub.(l-2, m-2) =(a.sub.l-2, a.sub.l-1, . . . , a.sub.m-3, a.sub.m-2) (7)
The partial pattern A.sub.(l-2, m-2) of the input pattern separated as shown in the relation (7) does not contain most of the partial patterns A.sub.(1, l-1) and A.sub.(m+1, I) composed of only noise. The partial pattern A.sub.(l-2, m-2), which is subjected to the matching with the reference pattern B.sup.n, still contains a part A.sub.(l-2, l-1) of the only noise contained partial pattern. Further, the partial pattern A does not contain a part A.sub.(m-2, m) of the partial pattern corresponding to the voice. Therefore, even if that improvement is made, it is impossible to obtain an optimum matching, and the lowering of the similarity measure is unavoidable. Such lowering of the similarity measure does not have advantageous effects on the result of the pattern matching with each "n" of the reference pattern B.sub.n. Therefore, a possibility of occurrence of the erroneous recognition is increased, hindering the improvement of the speech recognition performances.