Natural language processing systems are computer implemented software systems that intelligently derive meaning and context from natural language text. “Natural languages” are languages that are spoken by humans (e.g., English, French, Japanese). Computers cannot, without assistance, distinguish linguistic characteristics of natural language text. Natural language processing systems are employed in a wide range of products, including spell checkers, grammar checkers, machine translation systems, and speech synthesis programs.
Oftentimes, natural languages contain ambiguities that are difficult to resolve using computer automated techniques. Ambiguities come in many forms. Confusable words (e.g. then/than, its/it's, weather/whether) are one of the biggest sources of grammar errors by users. Possessive/plural types (e.g., kids/kid's) is another source for ambiguity. A third common example is part-of-speech tagging, such as differentiating whether “produce” is a noun or a verb. A fourth example is word sense disambiguation, such as deciding whether a particular instance of the word “crane” is referring to a bird or a machine.
Many natural language processing problems can be viewed as trying to disambiguate a token into one of a small number of possible labels, based upon the string context in which that token appears. For example, a spell checker may try to decide whether the word “then” or “than” is appropriate in the sentence “I am much smarter then/than you are.” A machine translation system may try to determine what the word sense is of the word “line” in the sentence “I am not going to wait in line”, so it can more accurately determine what the proper translation is. A speech synthesis program may try to decide whether the word “produce” is a noun or a verb in the sentence “This grocery store has beautiful produce”, in order to determine the proper pronunciation for the word.
To automatically perform disambiguations, the natural language processing system is provided with linguistic knowledge that it applies to the string context in order to disambiguate. Linguistic knowledge can either be entered manually or learned automatically. Typically, the manual approach has the advantage that people can provide linguistically sophisticated knowledge. Automatic approaches are beneficial in that the linguistic knowledge can be derived empirically from essentially unlimited amounts of data, can be rapidly ported to new domains, languages, or problems, and can be constantly and automatically adapted to a particular individual or subpopulation.
To date, automatic approaches have been extremely constrained in the types of linguistic information they can learn. For example, conventional systems automatically learn how to disambiguate words/phrases by learning cues based on whether a specific word appears within a pre-specified window of words from a “disambiguation site” (i.e., the place in the text where the ambiguity to be resolved actually occurs), and what combinations of words and word features (such as part of speech) appear in immediate proximity to the disambiguation site. The contextual words and phrases surrounding the disambiguation site are commonly referred to as the “string context” or simply “string”.
To provide a concrete example of a token disambiguation problem, a spell/grammar checker may wish to check whether the words “then” and “than” are confused anywhere in a document. Suppose the sentence is:                I am much bigger then you.The spell/grammar checker will try to determine whether “then” or “than” is the correct word. It does so by analyzing the string context (e.g., the sentence in which the word appears) and applying its linguistic knowledge to this string context to determine which word is more likely. In this particular example, it may make use of its linguistic knowledge that the word “than” immediately follows a comparative adjective much more often than the word “then”.        
There are two primary components of a machine learning approach to the problem of token disambiguation based on a string context: (1) the algorithms used for learning and applying the learned knowledge to perform disambiguation and (2) the specification of features the learner is allowed to explore in training. Over the past decade, there have been many different approaches to (1), but very little progress in (2).
For confusable word set disambiguation, an article by Golding and Schabes, entitled “Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction,” Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 1996, describes training a naïve Bayes classifier using as features the set of words that appear within +/−3 words of the target word and patterns of up to 2 contiguous words and/or part of speech tags around the target word. In an article by Golding and Roth, entitled “A Winnow-Based Approach to Spelling Correction,” Machine Learning, Special issue on Machine Learning and Natural Language Processing, Volume 34, pp. 107-130, 1999, the authors propose using the Winnow machine-learning algorithm with essentially the same features. In an article by Mangu and Brill, entitled “Automatic Rule Acquisition for Spelling Correction,” Proc. of the Fourteenth International Conference on Machine Learning, ICML '97, Nashville, Tenn., 1997, the authors describe use of transformation-based learning, again using the same features. In an article by Jones and Martin, entitled, “Contextual Spelling Correction Using Latent Semantic Analysis,” Proceedings of the Fifth Conference on Applied Natural Language Processing, 1997, the authors propose use of latent semantic analysis as the learning algorithm, and features that include the set of words and contiguous word pairs (bigrams) that appear within a window of +/−7 words of the target word.
For word sense disambiguation, an article by Ng, entitled “Exemplar-Based Word Sense Disambiguation: Some Recent Improvements,” Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 1997, describes systems for word sense disambiguation that employ two different machine learning algorithms, naïve Bayes and Nearest-Neighbor. In both systems, the features used were: word before, word after, word two before, word two after, the pair of words before, the pair of words after, and the two surrounding words. In an article by Yarowsky, entitled “One sense per collocation,” In Proceedings of the ARPA Human Language Technology Workshop, 1993, the author proposes using a decision list learning algorithm with a very similar set of features.
One attempt at a richer feature set was proposed by Christer Samuellson, Pasi Tapanainen and Atro Voutilainen in “Inducing Constraint Grammars,” published in Grammatical Inference: Learning Syntax from Sentences, Lecture Notes in Artificial Intelligence 1147, Springer (L. Miclet and C. del la Huguera eds), 1996. There they propose a system that can learn barrier rules for part of speech tagging. A barrier rule consists of a pair of symbols X and Y and a set of symbols S, and matches a string if that string contains X and Y, with X preceding Y and no symbols from the set S intervening between X and Y.
Despite these efforts, there remains a need for a method for learning much more expressive disambiguation cues. Such a method should be capable of being applied to virtually any problem involving token disambiguation in a string context, and should offer significant performance gains over current state of the art automatic linguistic knowledge acquisition solutions to these problems.