In the past, especially with respect to text editing, user mistakes such as spelling errors were corrected utilizing conventional spell-checking systems utilizing lookup tables. Subsequently, more sophisticated spell-checking systems were developed in which the context in which the word occurred was taken into account. These systems traditionally involved the utilization of so-called training corpora which contain examples of the correct use of the words in the context of the sentences in which they occurred.
One of the major problems with such context-sensitive spelling-correction systems is the inability of these systems to take into account situations in which the corpus on which they were trained is dissimilar to the target text to which they are applied. This is an important problem for text correction because words can be used in a wide variety of contexts; thus there is no guarantee that the particular contextual uses of the words seen in the target text will also have been seen in the training corpus.
Consider, for example, an algorithm whose job is to correct context-sensitive spelling errors; these are spelling errors that happen to result in a valid word of English, but not the word that was intended--for example, typing "to" for "too", "casual" for "causal", "desert" for "dessert", and so on. It is very difficult to write an algorithm to do this by hand. For instance, suppose we want to write an algorithm to correct confusions between "desert" and "dessert". We could write rules such as: "If the user types `desert` or `dessert`, and the previous word is `for`, then the user probably meant `dessert`". This rule would allow the algorithm to fix the error in: "I would like the chocolate cake for desert". The rule would not work, however, for many other cases in which "desert" and "dessert" were confused, for instance: "He wandered aimlessly through the dessert", where "desert" was probably intended. To fix this particular sentence, a different rule is needed, such as: "If the user types `desert` or `dessert`, and the previous two words are a preposition followed by the word `the`, then the user probably meant `desert`". In general, it is extremely difficult to write a set of rules by hand that will cover all cases.
This difficulty of writing a set of rules by hand is the motivation for moving to adaptive algorithms--algorithms that learn to correct mistakes by being trained on examples. Instead of writing rules by hand, it is much easier to provide a set of examples of sentences that use "desert" and "dessert" correctly, and let the algorithm automatically infer the rules behind the examples.
A wide variety of techniques have been presented in the Machine Learning literature for training algorithms from examples. However, what they all have in common is that they make the assumption of representativeness; that is, they assume that the set of examples that the algorithm is trained on is representative of the set of examples that the algorithm is asked to correct later. Put another way, they assume that the examples in the training and test sets are drawn, in an unbiased way, from the same population. It follows that whatever rules the algorithm learns from the training set will apply correctly to the test set. For example, if the training set contains examples illustrating that the word "for" occurs commonly before "dessert", but rarely before "desert", then, by the assumption of representativeness, the same distributional property of "for" should hold in the test set. If this assumption is violated, the algorithm's performance on the test set will degrade, because the rules it learned from the training set will not necessarily carry over to the test set. Existing machine learning techniques are therefore effective only to the extent that the training set is representative of the test set. This is a serious limitation, since, in general, there is no way to guarantee representativeness.