Context-sensitive text correction is the task of fixing errors that result in valid words, such as This is not to difficult, where too was mistakenly typed as to. These errors account for anywhere from 25% to over 50% of observed errors (Karen Kukich, Automatic spelling correction: Detection, correction and context-dependent techniques, p. 38. Technical report, Bellcore, Morristown, N.J. 07960, November 1991, Draft.); yet they go undetected by conventional spell checkers, such as Unix spell, which only flag words that are not found in a word list. The challenge of this task (and related natural language tasks) for Machine Learning is to characterize the contexts in which a word cart (or cannot) occur in terms of features. The problem is that there is a multitude of features that one might use--features, that test for the presence of some particular word within some distance of the target word; features that test the pattern of parts of speech around the target word; and so on.
Context-sensitive text correction is taken here to be a task of word disambiguation. The ambiguity among words is modelled by confusion sets. A confusion set C={w.sub.1, . . . w.sub.n } means that each word w.sub.i in the set is possibly ambiguous with each other word. Thus, if C={hear,here}, then when encountering either hear or here in a target document, it is assumed that any word that is a member in the confusion set (e.g. in the case under consideration hear or here) could possibly be the correct word. The task is to predict from the context of the surrounding features, preferably in high level of certainty, which word is actually correct. Generally speaking, confusion sets may encompass, for example, of words which resemble phonetically (e.g. cite, site, sight); words having textual resemblance, e.g. due to inadvertent typing mistake, such as exchange of letters, omission or addition of a letter (e.g. dessert vs. desert, or to vs. too). Confusion set may also encompass words which albeit being visually and phonetically distinguished tend to unduly replaced, e.g. among, between or amount, number etc. Accordingly, whenever referring to "spell checking" or "spell correction" the present invention likewise encompasses text correction. If desired, confusion sets may be customized to the specific application.
In the invention, for illustrative purposes only, confusion sets are confined to a list of "Words Commonly Confused" in the back of the Random House unabridged dictionary (Stuart Berg Flexner. editor. Random House Unabridged Dictionary. Random House, New York, (1983), Second edition).
There are known in the art techniques for context-sensitive text correction, including trigram-based methods for detailed discussion see for example: E. Mays., F. J. Damerau, and R. L. Mercer. "Context based spelling correction", Information Processing and Management, 27 (5); 517-522, (1991)!, Bayesian classifiers Gale et al. 1993!, decision lists Yarowsky, 1994!, and Bayesian hybrids (see Andrew R. Golding, ibid,). The last of these, (hereinafter Bayes), has been among the most successful, and is accordingly used herein as the benchmark for comparison with the context-sensitive text correction method and apparatus of the invention, Bayes has been described elsewhere (see Andrew R. Golding, ibid), and so will only be briefly reviewed here.
To disambiguate words w.sub.1 through w.sub.n, Bayes starts by learning features (learning features referred to also as "features mapping") that characterize the context in which each up tends to occur. It uses two types of features: context words and collocations. Context-word features test for the presence of a particular word within .+-.k words of the target word; collocations test for a pattern of up to l contiguous words and/or part-of-speech tags around the target word. Each word in the sentence is tagged with its set of possible part-of-speech tags (i.e. verb, noun, etc.), obtained from a dictionary. Thus, for example, the word laugh may stand for a verb or a noun (i.e. two distinct part of speeches), depending on the particular use in the sentence. If there is no a priori knowledge on the identity of the part of speech in an examined sentence that includes the word laugh, the latter is inserted to the database, along with its pertinent features, under two separate entries signifying noun and verb respectively.
Thus, for example, for k=10 and l=2, the features For the confusion set {weather,whether}, include:
(1) cloudy within .+-.10 words PA1 (2) .sub.------ to VERB PA1 a computer-readable medium; PA1 a database structure stored on the computer-readable medium, the database structure including: PA1 the plurality of features reside in the vicinity of said one target word in a training text corpus; PA1 the plurality of weight values indicating a contextual relationship at least between the target word and the plurality of features. PA1 (a) identifying at least one feature residing in the vicinity of said target word in a text; said at least one feature being associated with said target word in a database stored on said storage medium; PA1 (b) using the at least one feature identified in step (a) to acquire information from the database, the information being indicative as to the likelihood of said target word being in context with the at least one feature; PA1 (c) using the information retrieved in step (b) as a criterion for predicting whether the target word is likely to be either the correct word or should it be replaced within said text and in the latter case, altering the information in said database. PA1 (a) identifying at least one feature residing in the vicinity of said target word in a text; said at least one feature being associated with said target word in a database stored on said storage medium; PA1 (b) using the at least one feature identified in step (a) to retrieve information from the database, the information being indicative as to the likelihood of said target word being in context with the at least one feature; and PA1 (c) using the information retrieved in step (b) as a criterion for indicating whether the target word is likely to be either the correct word or should it be replaced within said text. PA1 wherein said step (c) is replaced by the following step (c') PA1 the plurality of weight values indicating a contextual relationship at least between the target word and the plurality of features.
where (1) is a context-word feature that tends to imply weather, and (2) is a collocation that checks for the pattern "to VERB" immediately after the target word, and tends to imply whether. (as in I don't know whether to laugh or cry).
Bayes learns these features from a training corpus of correct text. Each time a word in the confusion set occurs, Bayes proposes every feature that matches that context (one context-word feature for every distinct word within .+-.k words, and one collocation for every way of expressing a pattern of up to l contiguous elements). After working through the whole training corpus, Bayes tallies the number of times each feature was proposed. It then prunes features for two reasons: (1) the feature occurred in practically none or all or the training instances (specifically, it had fewer than 10 occurrences or fewer than 10 non-occurrences); or (2) the presence of the feature is not significantly correlated with the identity of the target word (determined by a chi-squared test at the 0.05 significance level).
The set of learned features is used at run time to classify an occurrence of a word in the confusion set. All the features are compared against the target occurrence.
Whilst Bayes, and other statistic bases approaches, purports to cope with the problem of context-sensitive text correction, they have not been matured to a commercially available product, due to their limited performance in terms of the success rate in revealing context-sensitive mistakes in a document and in offering appropriate substitute for revealed mistakes.
As will be explained in greater detail below, various embodiment of the present invention exploit some known per se techniques. Thus, apart from the above "features mapping" techniques, there is also used, in various embodiments of the invention a so called linear separation techniques which receives as an input a description of line in the form of: ##EQU1## where x.sub.i represents a given feature (x.sub.i =1 signifies that the feature is active; x.sub.i =0 signifies that the feature is non-active), and w.sub.i represents a positive weight. For a given input features (x.sub.1, x.sub.2, . . . x.sub.n) the algorithm predicts a value "1" if and only if (iff): ##EQU2## and "0" otherwise, where .theta. indicates a predetermined threshold. The linear separator algorithm separates, thus, between the various input instances as belonging to a first category, having a value "1", or to a second category having a value "0". By one, non limiting, embodiment of the present invention the known Winnow linear separation algorithm is utilized.
A detailed discussion of the Winnow algorithm is found in N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285-318, 1988.!, and accordingly there follows only a brief description thereof.
Generally speaking, Winnow has three parameters: a threshold .theta., and two update parameters, a promotion factor .alpha.&gt;1 and a demotion parameter 0&lt;.beta.&lt;1. For a given instance (x.sub.1, . . . , x.sub.n) the algorithm predicts 1 iff ##EQU3## where w.sub.i is the weight on the edge connecting x.sub.1 to the target node. Thus the hypothesis of this algorithm is a linear threshold function of {0,1}.sup.n. The algorithm updates its hypothesis only when a mistake is made. If the algorithm predicts 0 and the received label is 1 (positive example) then for all indices i such that x.sub.i =1, the weight w.sub.i is replaced by a larger weight .alpha..multidot.w.sub.i. If the algorithm predicts 1 and the received label is 0 (negative example) then for all indices i such that x.sub.i =1, the weight w.sub.i is replaced by a smaller weight B.multidot.w.sub.i. In both case , if x.sub.i =0, its weight remains unchanged.
Winnow is a mistake-driven algorithm; that is, it updates its hypothesis only when a mistake is made. Intuitively, this makes the algorithm more sensitive to the relationships among the attributes that may go unnoticed by an algorithm that is based on counts accumulated separately for each attribute.
The system and method of the invention utilizes, in various embodiment thereof, a so called "weighted majority algorithm". A detailed discussion of this algorithm is given below.
It is accordingly the object of the present invention to provide for a novel system and method for evaluating the use of a word in the context of surrounding words within a text.
It is another object of the invention to provide a database stored on a storage medium, for use in applications for evaluating the use of a word in the context of surrounding words within a text.
It is a specific object of the invention to provide for novel method and system for context sensitive text checking.