The present invention relates to a technique for improving an unsupervised training method for an N-gram language model.
Nowadays, language models are used in various fields such as speech recognition, machine translation, and information retrieval. The language models are statistical models for assigning an occurrence probability to a string of words or a string of characters. The language models include an N-gram model, a hidden Markov model, and a maximum entropy model, and among them, the N-gram model is most frequently used.
A language model is built by statistical training from training data. However, expressions used by users change from day to day, and new expressions are unknown to the language model trained from old training data. Therefore, the training data and the language model need to be updated on a regular basis.
Manual updating of the training data and the language model is not realistic. However, large amounts of speech data (hereinafter called field data) have recently been available with the emergence of a cloud type speech recognition system or a server type speech recognition system for providing voice search systems or speech recognition services, for example, in call centers. The results of unsupervised automatic recognition of these field data are useful to complement training data of the language model. The following will describe conventional techniques regarding the building of a language model and the use of the results of automatic speech recognition to build the language model or an acoustic model.
Norihiro Katsumaru, et al., “Language Model Adaptation for Automatic Speech Recognition to Support Note-Taking of Classroom Lectures,” The Special Interest Group Technical Reports of IPSJ, SLP, Speech language information processing, vol. 2008, no. 68, pp. 25-30, 2008 discloses a technique for building a language model from utterance units including a high proportion of content words with the reliability of speech recognition results higher than or equal to a threshold value when university lecture data are used for language model adaptation.
JP2011-75622 discloses a technique in which, when acoustic model adaptation is performed using multiple adaptation data composed of speech data and text attached with a reliability obtained as a result of speech recognition of the speech data, unsupervised adaptation is performed directly using adaptation data with a relatively high reliability, speech recognition text is manually corrected preferentially for data having a phoneme environment that is not included in the adaptation data with the high reliability among adaptation data with relatively low reliabilities to perform supervised adaptation, and data with relatively low reliabilities and for which text is not corrected are applied with a weight lower than that of the other data to perform unsupervised adaptation.
JP2008-234657 discloses a technique as a pruning method for a language model capable of pruning the language model in a size suitable for the application, in which all the highest order n-grams and their probabilities are removed from an n-gram language model M0 to generate an initial base model, and some of the most important pruned n-gram probabilities are added to this initial base model to provide a pruned language model.
Hui Jiang, “Confidence measures for speech recognition: A survey,” Speech Communication, Vol. 45, pp. 455-470, 2005 presents confidence measures available as three categories indicating the reliability of recognition results of automatic speech recognition. This literature is cited as a reference literature showing an example of reliability available in the present invention.