The present invention relates to language models used in language processing. In particular, the present invention relates to creating language models for a desired domain.
Language processing systems such as automatic speech recognition (ASR) often must deal with performance degradation due to errors originating from mismatch between the training, test data and actual domain data. As is well known, speech recognition systems employ an acoustic model and a statistical language model (LM) to provide recognition. Adaptation of acoustic model and language model for ASR has been extensively investigated and shown to improve ASR performance in some cases.
The statistical language model (LM) provides a prior probability estimate for word sequences. The LM is an important component in ASR and other forms of language processing because it guides the hypothesis search for the most likely word sequence. A good LM is known to be essential for superior language processing performance.
Training a LM, however, requires large amount of relevant data, which is usually unavailable for task specific speech recognition systems. An alternative way is to use small amount of domain and/or user specific data to adapt the LM trained with a huge amount of task independent data (e.g., Wall Street Journal) that is much easier to obtain. For example, one may harvest emails authored by a specific user to adapt the LM and improve the email dictation accuracy.
LM adaptation generally comprises four steps. First step includes collecting task specific adaptation data also known and as used herein “harvesting”. The second step may include normalization where adaptation data in written form are transformed into a standard form of words that would be spoken. Normalization is especially important for abbreviations, date and time, and punctuations. In the third step, the adaptation data are analyzed and a task specific LM is generated. In the last step, the task specific LM is interpolated with the task independent LM. The most frequently used interpolation scheme is linear interpolation:Pa(w|h)=μPt(w|h)+(1−μ)Pi(w|h),where w is the word, h is the history, Pa(w|h) is the adapted LM probability, Pt(w|h) is the task specific LM probability, Pi(w|h) is the task independent LM probability, and μ is the interpolation weight.
Many have focused on comparing adaptation algorithms and/or finding relevant data automatically; however, the quality of the data is also important. For example, all of the harvested email data for the user may not be useful for adapting the LM because there are parts which the user will never dictate such as email headers, long URL, code fragments, included reply, signature, foreign language text, etc. Adapting on all of the harvested data may cause significant degradation in the LM. For instance, the following header is automatically generated by the email client application. Adapting the LM with this text may corrupt the LM.
>From: Milind Mahajan
>Sent: Wednesday, Sep. 1, 2004 5:38 PM
>To: Dong Yu
>Subject: LM Adaptation
Filtering out non-dictated text is not an easy job in general. One common way of doing this is to use hand-crafted rules (e.g. a set of regular expressions). This approach has three limitations. First, it does not generalize well to situations, which we have not encountered. For example, you may have a rule to filter out Microsoft Outlook's email header, but that rule may not work with Yahoo email headers. Second, rules are usually language dependent. Porting rules from one language to another almost equals to rewriting the rules. Third, developing and testing rules are very costly.
In view of the foregoing, improvements can be made in processing data for creating a LM. A method that addresses one or more of the problems described above would be helpful.