The present invention relates to a speech recognizing system for recognizing speech through similarity computation involving pruning on the basis of a threshold and, more particularly, to a speech recognizing system capable of reducing the number of reference patterns of words to be recognized.
As means for reducing the processing effort in this type of speech recognition, a beam search process is well known in the art. The beam search process aims at processing effort reduction by pruning out reference patterns which are low in similarity to input speech and dispensing with a recognizing process on these low similarity reference patterns. The beam search process is detailed in Sakoe et al, "High rate DP Matching by Combined Frame Synchronization, Beam Search and Vector Quantization", Trans. of IECE of Japan, Vol. J71-D, No. 9, pp. 1650-1659, 1988-10 (hereinafter referred to as Literature 1). As a pruning process, Literature 1 shows one, in which similarity comparison is made with reference to a predetermined threshold to leave reference patterns with similarities higher than the threshold.
Murakami et al, "Expansion of Words to Speech Recognition and Free Speaking Recognition Utilizing Trigram", Technical Report of IECE of Japan, SP 93-127, 1994-01 (hereinafter referred to as Literature 2), shows a pruning process, in which a predetermined number of reference patterns with higher similarities are left. Literature 2 also shows another pruning process, in which search is performed for a threshold which gives a predetermined number of reference patterns that remain.
FIG. 6 is a block diagram showing the basic construction of a prior art speech recognition system (shown in Literature 1).
In this speech recognition system, a speech waveform is inputted as input speech data from an input terminal 301 to a speech analyzer 302 for its conversion to a feature vector series representing its acoustical features. Reference patterns representing acoustical features of words to be recognized, are cumulatively stored in a reference pattern memory 303. Partial reference patterns (or branches) which are subjects of similarity computation and the prevailing accumulation, are stored in a temporary pattern memory 304. A predetermined threshold is stored in a threshold memory 305.
A similarity computing unit 308 computes acoustical feature similarities, with input feature parameters, of those of branches, i.e., partial reference patterns, among those stored in the temporary pattern memory 304 having similarities higher than a threshold stored in the threshold memory 305. A determining unit 309 determines one of the branches stored in the temporary pattern memory 304 having the highest cumulative similarity as a result of recognition and, when similarity computations with all input feature parameters have been completed, or when it has become that the branches stored in the temporary pattern memory 304 all belong to a single word, and outputs this recognition result to an output terminal 310.
FIG. 7 is a block diagram showing a basic construction of another prior art speech recognition system (shown in Literature 2).
In this speech recognition system, a speech waveform is inputted as input speech data from an input terminal 401 to a speech analyzer 402 for its conversion to a feature vector series representing its acoustical features. Reference patterns representing acoustical features of words to be recognized are cumulatively stored in a reference pattern memory 403. Partial reference patterns (or branches) as subjects of similarity computation and the prevailing accumulation likelihood, are stored in a temporary pattern memory 404.
A similarity sorter 405 sorts out the branches, i.e., the partial reference patterns, stored in the temporary pattern memory 404 in the order of higher cumulative similarities. A similarity computing unit 408 computes acoustical feature similarities, with input feature parameters, of a predetermined number of higher similarity branches from the one of the highest similarity, having been sorted out in the similarity sorter 405 and stored in the temporary pattern memory 404, and updates the branches stored in the temporary pattern memory 404 and the accumulation likelihood. A determining unit 409 determines one of the branches stored in the temporary pattern memory 404 having the highest cumulative similarity as a result of recognition, when similarity computations with all input feature parameters have been completed, or when it has become that the branches stored in a temporary pattern memory 404 belong to a single word, and outputs the recognition result to an output terminal 410.
FIG. 8 is a block diagram showing a basic construction of a further prior art speech recognizer (shown in Literature 2).
In this speech recognizer, a speech waveform is inputted as input speech data from an input terminal 501 to a speech analyzer 502 for its conversion to a feature vector sequence representing its acoustical features. Reference patterns representing acoustical features of words to be recognized are stored in a reference pattern memory 503. Partial reference patterns (or branches) as subjects of similarity computation and the prevailing accumulation likelihood are stored in a temporary pattern memory 504.
A threshold searcher 505 searches a threshold, which leaves a predetermined number of branches, i.e., partial reference patterns, among those stored in the temporary pattern memory 504. A similarity computing unit 508 computes acoustical feature similarities, with input feature parameters, of only branches of similarities higher than the threshold obtained in the threshold searcher 505 among those stored in the temporary pattern memory 504, thereby updating the temporary pattern memory 504. A determining unit 509 determines one of the branches stored in the temporary pattern memory 504 having the highest cumulative similarity as a result of recognition, when similarity computations with all input feature parameters have been completed, or when it has become that the branches stored in the temporary pattern memory 54 all belong to a single word, and outputs the recognition result to an output terminal 510.
In the speech recognition system shown in FIG. 6, a predetermined threshold is given. In this case, it is impossible to control the number of branches that are left in the beam after pruning. This gives rise to a problem that the number of branches after pruning may be excessive to make read-time operation difficult or insufficient to result in pruning of correct vocabulary branches.
In the speech recognition system shown in FIG. 7, denoting the number of branches to be sorted by N, computational effort of the order of N log N is necessary. Therefore, as N is increased, the processing effort necessary for the sorting is increased to result in an excessive process time.
In the speech recognition system shown in FIG. 8, the threshold can be searched efficiently by two-branched search. In this case, the processing effort, i.e., computational effort, necessary for obtaining the threshold is reduced compared to the case of sorting all the branches as shown in FIG. 7. Nevertheless, computational effort of the order of log N is necessary. Therefore, as N is increased, the processing effort necessary for the sorting is increased to result in an excessive process time.
In the above speech recognition systems, the threshold cannot be changed such as to give a desired number of branches. Therefore, it is impossible to obtain accurate and quick speech recognition with less computational effort.