Techniques of converting speech to text by speech recognition are useful in many fields such as making transcripts in the medical and legal fields and creating broadcast captions. Speech-to-text conversion can facilitate database searching.
For example, telephone conversations in a call center are converted to text so as to associate speech with text, and the text is subjected to a string search, whereby speech associated with the text can be easily found. This enables customer names, item numbers, and the like included in telephone conversations to be used as keywords for search refinement, so that monitoring can be performed with pinpoint accuracy. However, speech recognition results sometimes include misrecognition, causing a reduction in search accuracy. How to reduce misrecognition is an issue to be addressed.
A typical existing speech recognition technique uses an acoustic model that associates features of speech with phonemes and a language model that represents the relation between a sequence of words. The methods of using n-gram models described in Non Patent Literatures 1 to 3 are attracting attention as the usage of language models for accurate speech recognition. An n-gram model is a probabilistic language model generated through learning from learning example sentences and used for predicting the next word from previous (n−1) words.
The problem of n-gram models is that a sequence of words that does not exist in learning example sentences has an appearance probability of zero, which is called a sparseness problem. A typical solution to this problem is to use smoothing (refer to Non Patent Literature 2 below).