In recent years, an increasing number of information terminal apparatuses such as PCs (personal computers), PDAs (personal digital assistants), mobile phones, and remote controllers have been provided with a speech input function. This has enabled users to input commands or keywords by their speeches. That is, the use of such an information terminal apparatus makes it possible to identify, by an unknown speech inputted by a user, a keyword that the user wishes to input. It is one of the important issues in the field of speech recognition techniques to accurately and quickly specify the position of a keyword.
Non-patent Document 1 proposes a variable frame rate technique, applied to speech recognition, which has as an object to quickly identify an input speech by eliminating speech wave frames having very similar features. In Non-patent Document 1, speech features are combined by defining an appropriate threshold value by mathematical derivation, and a speech feature vector sequence including a phonological feature structure is obtained. However, the setting of such a threshold value is very difficult, and directly affects the precision of recognition. Further, the method proposed in Non-patent Document 1 uses nonlinear matching, and therefore requires a large amount of calculation in a process of identifying a keyword.
In Non-patent Document 2, speech features are combined by calculating distances between vectors in a feature vector space and by defining an appropriate threshold value, and a speech feature vector sequence including a phonological feature structure is obtained. However, such a combination is targeted at a speech of a specific speaker. Therefore, representative feature points representing an identical phonological feature include feature information on a large number of speakers, and variations are large. This makes it necessary to perform resampling of a speech trace in a subsequent matching process. This causes an increase in complexity of recognition. Further, Non-patent Document 2 does not provide a good solution for problems with resampling techniques. This makes it difficult to ensure the precision of recognition. Furthermore, the amount of calculation required to calculate the distances between vectors are very large, and the combination of features makes it very difficult to set an appropriate threshold value. Further, the setting of such a threshold value directly affects the correctness or incorrectness of an estimation of a speech trace including a phonological feature structure. These factors prevent an increase in the degree of accuracy of subsequent matching based on a speech feature space trace.
Further, a technique disclosed in Non-patent Document 2 with respect to the establishment of a keyword template also uses the combining method, and estimates a keyword speech feature space trace. The content of a keyword is designed for a specific recognition task region. Specifically, the keyword speech trace is not generated by a learning corpus of plural application regions; therefore, it is difficult to directly apply the keyword speech trace to an unspecified speaker region. In cases where the task region is changed, it is necessary to produce a new keyword speech template. Therefore, in the technique disclosed in Non-patent Document 2, the keyword speech trace template does not have general versatility, and therefore has a difficulty in actual application.
According to the foregoing problems, the methods proposed by Non-patent Documents 1 and 2 cannot be actually applied to information terminal apparatuses. This makes it necessary to rapidly locate a keyword in an input speech and to reduce the amount of calculation.
[Non-patent Document 1] “Application of Variable Frame Rate Techniques to Speech Recognition”, Sun, F., Hu, G., and Yu, X., Journal of Shanghai Jiaotong University, Vol. 32, No. 8, August 1998
[Non-patent Document 2] “Keyword Spotting Method Based on Speech Feature Trace Matching”, Wu, Y. and Liu, B., Proceedings of the Second Conference on Machine Learning and Cybernetics, Nov. 2 to 5, 2003