1. Field of the Invention
The present invention relates to an apparatus for recognizing time series signals, such as human speech and other acoustic signals.
2. Description of the Background Art
Conventionally, a time series signal recognition, such as speech recognition, has been achieved basically by first performing a so called segmentation in which a word boundary is detected in the time series signals, and then look for a matching between a reference pattern in a speech recognition dictionary and a word feature parameter extracted from the signal within the detected word boundary. There are several speech recognition methods which falls within this category of the prior art, which includes DP matching, HMM (Hidden Markov Model), and the Multiple Similarity (partial space) method.
However, in more realistic noisy environments there has been a problem in practice that many recognition errors due to failure of the appropriate word boundary detection as are due to false pattern matching.
Namely, the detection of the word boundary has conventionally been performed with energy or pitch frequency as a parameter, so that highly accurate recognition tests can be performed in a quiet experiment room. But, the recognition rate drastically decreases for more practical locations for use, such as inside offices, cars, stations, or factories.
To cope with this problem, there has been a proposition of a speech recognition method, called a word spotting (continuous pattern matching) method, in which the word boundary is taken to be not fixed but flexible, but this method is associated with another kind of recognition error problem.
This can be seen from the diagram of FIG. 1 in which an example of time series for an energy of a signal is depicted along with indications for three different noise levels. As shown in FIG. 1, the word boundary for this signal progressively gets narrower as the noise level increases from N1 to N2 and to N3, which are indicated as intervals (S1, E1), (S2, E2), and (S3, E3), respectively. However, the speech recognition dictionary is usually prepared by using the word feature vectors obtained by using the specific word boundaries and the specific noise level, so that when such a conventional speech recognition dictionary is used with the word spotting method, the matching with the word feature vector obtained from an unfixed word boundary for a speech mixed with noise having a low signal/noise ratio becomes troublesome, and many recognition errors occur.
On the other hand, for a speech recognition method using a fixed word boundary, there is a learning system for a speech recognition dictionary in which the speech variations are taken into account artificially, but no effective learning system is known for the word spotting method, so that the word spotting method has been plagued by the problem of excessive recognition errors.
Thus, although sufficiently high recognition rate has been obtainable for experiments performed in a favorable noiseless environment, such as an experimental room, conducted by an experienced experimenter, a low recognition rate resulted in a more practical noisy environment with an inexperienced speaker because of errors in word boundary detection. This has been a major obstacle for realization of a practical speech recognition system. Furthermore, the speech recognition dictionary and the word boundary detection have been developed rather independent of each other, so that no effective learning system has been known for the speech recognition method using an unfixed word boundary, such as the word spotting method.
It is also to be noted that these problems are relevant not only for speech recognition, but also to the recognition of other time series signals, such as vibrations or various sensor signals.