This invention relates to linguistic ambiguity resolution. More particularly, this invention relates to systems and method for training linguistic disambiguators using string-based patterns.
Natural language processing systems are computer implemented software systems that intelligently derive meaning and context from natural language text. xe2x80x9cNatural languagesxe2x80x9d are languages that are spoken by humans (e.g., English, French, Japanese). Computers cannot, without assistance, distinguish linguistic characteristics of natural language text. Natural language processing systems are employed in a wide range of products, including spell checkers, grammar checkers, machine translation systems, and speech synthesis programs.
Oftentimes, natural languages contain ambiguities that are difficult to resolve using computer automated techniques. Ambiguities come in many forms. Confusable words (e.g. then/than, its/it""s, weather/whether) are one of the biggest sources of grammar errors by users. Possessive/plural types (e.g., kids/kid""s) is another source for ambiguity. A third common example is part-of-speech tagging, such as differentiating whether xe2x80x9cproducexe2x80x9d is a noun or a verb. A fourth example is word sense disambiguation, such as deciding whether a particular instance of the word xe2x80x9ccranexe2x80x9d is referring to a bird or a machine.
Many natural language processing problems can be viewed as trying to disambiguate a token into one of a small number of possible labels, based upon the string context in which that token appears. For example, a spell checker may try to decide whether the word xe2x80x9cthenxe2x80x9d or xe2x80x9cthanxe2x80x9d is appropriate in the sentence xe2x80x9cI am much smarter then/than you are.xe2x80x9d A machine translation system may try to determine what the word sense is of the word xe2x80x9clinexe2x80x9d in the sentence xe2x80x9cI am not going to wait in linexe2x80x9d, so it can more accurately determine what the proper translation is. A speech synthesis program may try to decide whether the word xe2x80x9cproducexe2x80x9d is a noun or a verb in the sentence xe2x80x9cThis grocery store has beautiful producexe2x80x9d, in order to determine the proper pronunciation for the word.
To automatically perform disambiguations, the natural language processing system is provided with linguistic knowledge that it applies to the string context in order to disambiguate. Linguistic knowledge can either be entered manually or learned automatically. Typically, the manual approach has the advantage that people can provide linguistically sophisticated knowledge. Automatic approaches are beneficial in that the linguistic knowledge can be derived empirically from essentially unlimited amounts of data, can be rapidly ported to new domains, languages, or problems, and can be constantly and automatically adapted to a is particular individual or subpopulation.
To date, automatic approaches have been extremely constrained in the types of linguistic information they can learn. For example, conventional systems automatically learn how to disambiguate words/phrases by learning cues based on whether a specific word appears within a pre-specified window of words from a xe2x80x9cdisambiguation sitexe2x80x9d (i.e., the place in the text where the ambiguity to be resolved actually occurs), and what combinations of words and word features (such as part of speech) appear in immediate proximity to the disambiguation site. The contextual words and phrases surrounding the disambiguation site are commonly referred to as the xe2x80x9cstring contextxe2x80x9d or simply xe2x80x9cstringxe2x80x9d.
To provide a concrete example of a token disambiguation problem, a spell/grammar checker may wish to check whether the words xe2x80x9cthenxe2x80x9d and xe2x80x9cthanxe2x80x9d are confused anywhere in a document. Suppose the sentence is:
I am much bigger then you.
The spell/grammar checker will try to determine whether xe2x80x9cthenxe2x80x9d or xe2x80x9cthanxe2x80x9d is the correct word. It does so by analyzing the string context (e.g., the sentence in which the word appears) and applying its linguistic knowledge to this string context to determine which word is more likely. In this particular example, it may make use of its linguistic knowledge that the word xe2x80x9cthanxe2x80x9d immediately follows a comparative adjective much more often than the word xe2x80x9cthenxe2x80x9d.
There are two primary components of a machine learning approach to the problem of token disambiguation based on a string context: (1) the algorithms used for learning and applying the learned knowledge to perform disambiguation and (2) the specification of features the learner is allowed to explore in training. Over the past decade, there have been many different approaches to (1), but very little progress in (2).
For confusable word set disambiguation, an article by Golding and Schabes, entitled xe2x80x9cCombining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction,xe2x80x9d Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 1996, describes training a naxc3xafve Bayes classifier using as features the set of words that appear within +/xe2x88x923 words of the target word and patterns of up to 2 contiguous words and/or part of speech tags around the target word. In an article by Golding and Roth, entitled xe2x80x9cA Winnow-Based Approach to Spelling Correction,xe2x80x9d Machine Learning, Special issue on Machine Learning and Natural Language Processing, Volume 34, pp. 107-130, 1999, the authors propose using the Winnow machine-learning algorithm with essentially the same features. In an article by Mangu and Brill, entitled xe2x80x9cAutomatic Rule Acquisition for Spelling Correction,xe2x80x9d Proc. of the Fourteenth International Conference on Machine Learning, ICML""97, Nashville, Tenn., 1997, the authors describe use of transformation-based learning, again using the same features. In an article by Jones and Martin, entitled, xe2x80x9cContextual Spelling Correction Using Latent Semantic Analysis,xe2x80x9d Proceedings of the Fifth Conference on Applied Natural Language Processing, 1997, the authors propose use of latent semantic analysis as the learning algorithm, and features that include the set of words and contiguous word pairs (bigrams) that appear within a window of +/xe2x88x927 words of the target word.
For word sense disambiguation, an article by Ng, entitled xe2x80x9cExemplar-Based Word Sense Disambiguation: Some Recent Improvements,xe2x80x9d Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 1997, describes systems for word sense disambiguation that employ two different machine learning algorithms, naive Bayes and Nearest-Neighbor. In both systems, the features used were: word before, word after, word two before, word two after, the pair of words before, the pair of words after, and the two surrounding words. In an article by Yarowsky, entitled xe2x80x9cOne sense per collocation,xe2x80x9d In Proceedings of the ARPA Human Language Technology Workshop, 1993, the author proposes using a decision list learning algorithm with a very similar set of features.
One attempt at a richer feature set was proposed by Christer Samuellson, Pasi Tapanainen and Atro Voutilainen in xe2x80x9cInducing Constraint Grammars,xe2x80x9d published in Grammatical Inference: Learning Syntax from Sentences, Lecture Notes in Artificial Intelligence 1147, Springer (L. Miclet and C. del la Huguera eds), 1996. There they propose a system that can learn barrier rules for part of speech tagging. A barrier rule consists of a pair of symbols X and Y and a set of symbols S, and matches a string if that string contains X and Y, with X preceding Y and no symbols from the set S intervening between X and Y.
Despite these efforts, there remains a need for a method for learning much more expressive disambiguation cues. Such a method should be capable of being applied to virtually any problem involving token disambiguation in a string context, and should offer significant performance gains over current state of the art automatic linguistic knowledge acquisition solutions to these problems.
A linguistic disambiguation system and method creates a knowledge base by training on patterns in strings that contain ambiguity sites. The system is trained on a training set, such as a properly labeled corpus. The string patterns are described by a set of reduced regular expressions (RREs) or very reduced regular expressions (VRREs), which specify features that the training system is allowed to explore in training. The resulting knowledge base utilizes the RREs or VRREs to describe strings in which an ambiguity occurs. In this way, the technique can be applied to virtually any problem involving token disambiguation in a string context.
In the described implementation, the set of reduced regular expressions (RREs) over a finite alphabet xcexa3 is defined as:
(1) ∀axcex5xcexa3: xe2x80x9caxe2x80x9d is a reduced regular expression and denotes a set {a};
xe2x80x9ca+xe2x80x9d is a reduced regular expression and denotes a positive closure of the set {a};
xe2x80x9ca*xe2x80x9d is a reduced regular expression and denotes a Kleene closure of the set {a};
xe2x80x9cxcx9caxe2x80x9d is a reduced regular expression and denotes a set xcexa3-a;
xe2x80x9cxcx9ca+xe2x80x9d is a reduced regular expression and denotes the positive closure of the set xcexa3-a;
xe2x80x9cxcx9ca*xe2x80x9d is a reduced regular expression and denotes the Kleene closure of the set xcexa3-a;
(2) xe2x80x9c.xe2x80x9d is a reduced regular expression denoting a set xcexa3;
(3) xe2x80x9c.+xe2x80x9d is a reduced regular expression denoting the positive closure of the set xcexa3;
(4) xe2x80x9c.*xe2x80x9d is a reduced regular expression denoting the Kleene closure of the set xcexa3; and
(5) if r and s are reduced regular expressions denoting languages R and S, respectively, then xe2x80x9crsxe2x80x9d is a reduced regular expression denoting a set RS.
It is noted, however, that reduced regular expressions may contain variations or extensions of the above definition.
The set of very reduced regular expressions (VRREs) over an alphabet xcexa3 is defined as:
(1) ∀axcex5xcexa3: xe2x80x9caxe2x80x9d is a very reduced regular expression and denotes a set {a};
(2) xe2x80x9c.xe2x80x9d is a very reduced regular expression denoting a set xcexa3;
(3) xe2x80x9c.*xe2x80x9d is a very reduced regular expression denoting a Kleene closure of the set xcexa3; and
(4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then xe2x80x9crsxe2x80x9d is a very reduced regular expression denoting a set RS.
The set of RREs is strictly greater than the set of VRREs. In other words, every VRRE is an RRE but not every RRE is a VRRE.
Once trained, the system may then apply the knowledge base to raw input strings that contain ambiguity sites. The system uses the RRE-based knowledge base to disambiguate the sites.