Natural language processing (NLP) systems, such as speech recognition, machine translation, or other text to text applications, typically rely on language models to allow a machine to recognize speech. The performance of these systems can be improved by customizing the model for a specific domain and/or application. A typical way of forming such a model is to base the model on text resources. For example, a model for a specific domain may be based on text resources that are specific to that domain.
Sometimes, text for a target domain might be available from an institution, that maintains a repository of texts, such as NIST or LDC. Other times, the data is simply collected manually.
Manual collection of data may be very difficult, and may add to system turnaround time and cost. Moreover, the amount of available data for a specific domain may be quite limited. In order to limit the effects of minimal domain specific data, a topic independent language model is often merged with a topic-specific language model generated from the limited in-domain data. This operation may form a hybrid model. The hybrid model may be smoothed to form a final topic specific language model.
This approach, however, is often less accurate compared to an approach where effective amounts of in-domain data are available.