Speech recognition techniques for detecting specific keywords from speech signals recording, e.g., conversations or speeches have been conventionally employed. Such speech recognition techniques use, e.g., HMMs (Hidden Markov Models) as acoustic models. In particular, a GMM-HMM has been proposed in which the output probability of each phoneme for features of input speech in each state of the HMM is calculated on the basis of a mixture of normal distributions (GMM: Gaussian Mixture Model) (see, e.g., A. J. Kishan, “ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING,” Ph.D. Thesis, Queensland University of Technology, 2005 (to be referred to as non-patent literature 1 hereinafter)).
The technique disclosed in non-patent literature 1 is referred to as word spotting, which postulates that a speech signal contains words that are not to be detected. Therefore, in this technique, for keywords to be detected, a triphone GMM-HMM obtained by learning a GMM-HMM for each combination of the phoneme of interest and the preceding and succeeding phonemes is used for likelihood calculation of a maximum-likelihood phoneme string. For other spoken words, a monophone GMM-HMM obtained by learning a GMM-HMM for each phoneme of interest independently of the preceding and succeeding phonemes is used for likelihood calculation of a maximum-likelihood phoneme string.
Techniques which use neural networks have been proposed to improve the recognition accuracy in speech recognition techniques which use HMMs as acoustic models (see, e.g., U.S. Patent Application Publication No. 2012/0065976).
In the technique disclosed in U.S. Patent Application Publication No. 2012/0065976, a DBN (Deep Belief Network) (also referred to as a DNN (Deep Neural Network; to be referred to as a DNN hereinafter)) is used instead of a GMM, to calculate the output probability of each phoneme in each state of the HMM. In other words, a feature vector including a plurality of features calculated from a speech signal is input to the DNN to calculate the output probability of each phoneme in each state of the HMM. This technique then obtains the product of the calculated output probability and the state transition probability for each phoneme in accordance with the HMM to calculate a likelihood for a maximum-likelihood phoneme string.