The present invention relates to language models used in language processing. In particular, the present invention relates adapting language models for a desired domain.
Language processing systems such as automatic speech recognition (ASR) often must deal with performance degradation due to errors originating from mismatch between the training and test data and actual domain data. As is well known, speech recognition systems employ an acoustic model and a statistical language model (LM) to provide recognition. Adaptation of the acoustic model to a new domain has been addressed with limited success; however, adaptation of the language model has not achieved satisfying results.
The statistical language model (LM) provides a prior probability estimate for word sequences. The LM is an important component in ASR and other forms of language processing because it guides the hypothesis search for the most likely word sequence. A good LM is known to be essential for superior language processing performance.
Commonly, the LM uses smoothed n-gram statistics gathered from a large amount of training data expected to be similar to the test data. However, the definition of similarity is loose and it is usually left to the modeler to decide, most of the time by trial and error, what data sources should be used for a given domain of interest.
Invariably, mismatch exists between the training or test data and the actual domain or “in-domain” data, which leads to errors. One source of mismatch comes from the out-of vocabulary words in the test data. For example, an air travel information system originally designed for one airline may not work well for another due to the mismatch in city names, airport names, etc. served by the company in question.
Another potential source of mismatch comes from different language style. For example, the language style in the news domain is different from the air travel information domain. A language model trained on newswire or other general text may not perform very well in an air travel information domain.
Although various approaches have been tried to adapt a LM trained on a large amount of background data using different techniques, none have achieved superior results, and thus improvements in LM adaptation are continually needed. A method that addresses one or more of the problems described above would be helpful.