Many technologies benefit from adaptation to a user's particular linguistic style. For example, spell checkers, spam filters, acoustic and language models for speech recognizers, and the like, utilize adaptation techniques to optimize their efficiency and accuracy. Harvesting pre-existing documents and files provides one potential source of data that can be used to learn about the user's linguistic style.
However, typical adaptation techniques perform well only when the used data is representative of the user's linguistic style. The available documents and files may frequently contain repeated content such as multiple versions of the same document or mail threads with many replies to the same initial email. Often, it may be difficult to keep track of which documents or data have already been processed by the adaptation system in order to determine the relevance of a new file or document. For example, when the data includes a long mail thread, the multiple replies may repeat the original posting many times. Adapting directly from such data may unduly bias the personalized model to repeated data rather than to a more representative spectrum of data.
In many ways, documents that have multiple versions are more likely not to be the product of a particular user but instead the product of a group of people and therefore not as representative of the user's linguistic style as a document that only occurs once. This leaves an adaptation system vulnerable to two errors. The system may learn patterns of language from other users with as much weight as the targeted user, and it may learn biased frequencies as it sees the same data “too many” times.
An example of linguistic style adaptation is speech recognition systems. Many current speech recognition systems use language models which are statistical in nature. Such language models are typically generated using known techniques based on a large amount of textual training data which is presented to a language model generator. An N-gram language model may use, for instance, known statistical techniques such as Katz's technique, or the binomial posterior distribution back-off technique. In using these techniques, the language models estimate the probability that a word w(n) will follow a sequence of words w1, w2, . . . w(n−1). These probability values collectively form the N-gram language model. There are many known methods which can be used to estimate these probability values from a large text corpus presented to the language model generator. When such large text corpora are used, unintentional biasing due to repeated data may skew the adapted language model.
It is with respect to these and other considerations that the present invention has been made.