The present invention relates to a speech recognition apparatus and, more particularly, to a speech recognition apparatus which can approximate the movable range of the feature vector at each time point, and execute a distance calculation such that the optimal combination of all the combinations within the range is obtained as a distance value.
According to a conventional matching method for speech recognition, input speech is converted into a time series data of feature vectors of one type, and standard speech is analyzed and converted into feature vectors of one type by the same method as that used for the input speech to be stored as a standard pattern. The distances between these vectors are then calculated by using a matching method such as DP matching which allows nonlinear expansion/contraction in the time axis direction. The category of the standard pattern which exhibits the minimum distance is output as the recognition result. At each time point in matching, the distance or similarity between one feature vector of the input speech and one feature vector of a standard pattern is calculated.
According to this method, however, since the voices generated by different speakers greatly vary in feature in many cases even if the contents of utterances are the same, high performance cannot be obtained with respect to speech from a speaker different from standard speakers. Even with respect to speech from a single speaker, stable performance cannot be obtained when the feature of the speech changes due to the speaker's physical condition, a psychological factor, and the like. To solve this problem, a method called a multi-template scheme has been used.
A multi-template is designed such that a plurality of standard speaker voices are converted into a plurality of feature vectors to form standard patterns, and the feature of the standard pattern at each time point is expressed by a plurality of feature vectors. A distance calculation is performed by a so-called Viterbi algorithm of obtaining the distances or similarities between all the combinations of one input feature vector and a plurality of feature vectors of standard patterns at each time point, and using the optimal one of the obtained distances or similarities, a so-called Baum-Welch algorithm of expressing the feature at each time point by the weighted sum of all distances or similarities, a semi-continuous scheme, or the like.
In the conventional distance calculating methods for speech recognition, even in the method using a multi-template, differences between voices are expressed by only discrete points in a space, and distances or similarities are calculated from only finite combinations of the points. In many cases, these methods cannot satisfactorily express all events that continuously changer and hence high recognition performance cannot be obtained.
For examples such events include speech recognition performed in the presence of large ambient noise. In the presence of ambient noise, noise spectra are added to an input speech spectrum in an additive manner. In addition, the level of the noise varies at the respective time points, and hence cannot be predicted. For example, in a known conventional scheme, to solve this problem, standard pattern speech is formed by using a finite number of several types of SNRs (Signal to Noise Ratios), and speech recognition is performed by using multi-templates with different SNR conditions.
Since infinite combinations of SNRs of input speech are present, and are difficult to predict, it is theoretically impossible to solve the above problem by using a finite number of templates. It is seemingly possible to express a continuous change by a sufficient number of discrete points so as to improve the calculation precision to such an extent that an error can be approximately neglected. It is practically impossible in terms of the cost for data collection to collect enough voices from many speakers to cover all SNRs in all noise environments under SNR conditions. Even if such data collection is possible, the memory capacity and the amount of distance calculation which are required to express continuous events by many points are enormous. In this case, therefore, an inexpensive apparatus cannot be provided.
Other events in which the feature of speech continuously changes include the case of a so-called Lombard effect in which speech generated in the presence of noise itself changes when the speaker hears the noise, and a case in which the features of voices from an extremely large number of speakers change.