English words are ambiguous with respect to their parts-of-speech. For instance a given word can function as a noun, a verb in past tense, and a verb in past participle. For example, the word "left" can be an adjective, as in "I took a left turn"; a noun, as in "He is on my left"; as the past tense of the verb "leave", as in "He left yesterday"; and as the past participle of the verb leave (as in "He has left"). However in context English words are not ambiguous. Most applications dealing with English text need to assign the correct part-of-speech to each word in the context it appears. This problem is called part-of-speech tagging.
The ability to detect the sequence of parts-of-speech as they exist in a given sentence is of paramount importance for many applications involving English text such as grammar checkers, spell checkers, text retrieval, speech recognition, hand writing recognition devices, character recognition devices and text compression devices. The result of having derived parts-of-speech is a part-of-speech sequence such as "PRONOUN, VERB, DETERMINER, NOUN, VERB" for an input sentence "I heard this band play".
Previous methods for assigning part-of-speech tags to English text consist of either statistically based methods or rule-based methods. Examples of statistically-based methods are the method of Kenneth Church's Stochastic Parts Program published as "A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text" in the Proceedings of the Second Conference on Applied Natural Language Processing, Austin Tex., 1988, or the one of Charniak, Eugene, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz published as "Equations for part-of-speech tagging" in the Proceedings of the AAAI 93, Ninth National Conference on Artificial Intelligence 1993, or the method of Julian Kupiec published as "Robust part-of-speech tagging using a hidden markov model" in the journal of Computer Speech and Language volume 6 in 1992 or the one of Ralph Weischedel, Marie Meteer, Richard Schwartz, Lance Ramshaw, and Jeff Palmucci published as "Coping with ambiguity and unknown words through probabilistic models" in the journal of Computation Linguistics volume 18, number 2 in 1993. An example of a rule-based method is the method of Eric Brill published as "A simple rule-based part of speech tagger" in the proceedings of the Third Conference on Applied Natural Language Processing in 1992.
Prior art methods for assigning part of speech tags are very slow since the time required to assign part of speech tags is related to the number of words in the input sentence and also to the number of rules they use. This makes the prior art systems inapplicable to very large English texts such as the contents of a library.
Recently, as indicated above, Brill described a rule-based tagger which performs as well as taggers based upon probabilistic models and which overcomes the limitations common in rule-based approaches to language processing. It is robust and the rules are automatically acquired. In addition, the tagger requires drastically less space than stochastic taggers. However, current implementations of Brill's tagger are considerably slower than the ones based on probabilistic models since it may require RCn elementary steps to tag an input of n words with R rules requiring at most C words of context.
In Brill, as an example, 200 contextual tagging rules are used, one-by-one for each word to obtain the part of speech tag. This is relatively slow because each of the rules is applied individually on each word and because the output of one rule may be changed by the output of a later rule. One reason for the relatively slowness of the Brill system is his non-deterministic approach in which the output of one rule may be changed by the output of another rule. On the other hand, a deterministic system is desireable to increase speed in which after each word is read only one part of speech choice is made; and this without requiring more than one pass on the input sentence.
Note that Brill's tagger is comprised of three parts, each of which is inferred from a training corpus: a lexical tagger, an unknown word tagger and a contextual tagger. For the purpose of exposition, the discussion of the unknown word tagger is postponed and the focus of the following discussion is mainly the contextual rule tagger.
The notation for part-of-speech tags is as follows: "pps" stands for third singular nominative pronoun, "vbd" for verb in past tense, "np" for proper noun, "vbn" for verb in past participle form, "by" for the word "by", "at" for determiner, "nn" for singular noun and "bedz" tbr the word "was".
By way of background, the lexical tagger used by Brill initially tags by assigning each word its most likely tag, estimated by examining a large tagged corpus, without regard to context. For example, assuming that "vbn" is the most likely tag for the word "killed" and "vbd" for "shot", the lexical tagger might assign the following part-of-speech tags:
(1) Chapman/np killed/vbn John/np Lenon/np PA1 (2) John/np Lenon/np was/bedz shot/vbd by/by Chapman/np PA1 (3) He/pps witnessed/vbd Lenon/np killed/vbn by/by Chapman/np PA1 rule 1: vbn vbd PREVTAG np PA1 rule 2: vbd vbn NEXTTAG by PA1 (4) Chapman/np killed/vbd John/np Lenon/np PA1 (5) John/np Lenon/np was/bedz shot/vbd by/by Chapman/np PA1 (6) He/pps witnessed/vbd Lenon/np killed/vbd by/by Chapman/np PA1 (7) Chapman/np killed/vbd John/np Lenon/np PA1 (8) John/np Lenon/np was/bedz shot/vbn by/by Chapman/np PA1 (9) He/pps witnessed/vbd Lenon/np killed/vbn by/by Chapman/np
Since the lexical tagger used by Brill does not use any contextual information, many words can be wrongly tagged. For example, in (1) the word "killed" is erroneously tagged as a verb in past participle form, and in (2) "shot" is incorrectly tagged as a verb in past tense. Given the initial tagging obtained by the lexical tagger, in the Subject System a contextual tagger applies a sequence of rules in order and attempts to remedy the errors made by the initial tagging. For example, the rules below might be found in a contextual tagger.
The first rule says to change tag "vbn" to "vbd" if the previous tag is "np". The second rule says to change "vbd" to tag "vbn" ff the next tag is "by". Once the first rule is applied, the tag for "killed" in (1) and (3) is changed from "vbn" to "vbd" and the following tagged sentences are obtained:
And once the second rule is applied, the tag for "shot" in (5) is changed from "vbd" to "vbn" resulting (8) and the tag for "killed" in (6) is changed back from "vbd" to "vbn" resulting (9):
In Brill, the sequence of contextual rules is automatically inferred from a training corpus. A list of tagging errors, with their counts, is compiled by comparing the output of the lexical tagger to the correct part-of-speech assignment. Then, for each error, it is determined which instantiation of a set of rule templates results in the greatest error reduction. Then the set of new errors caused by applying the rule is computed and the process is reiterated until the error reduction drops below a given threshold. The following Table illustrates a set of contextual rule templates.
TABLE I ______________________________________ A B PREVTAG C change A to B if previous tag is C A B PREV1OR2OR3TAG C change A to B if previous one or two or three tag is C A B PREV1OR2TAG C change A to B if previous one or two tag is C A B NEXT1OR2TAG C change A to B if next one or two tag is C A B NEXTTAG C change A to B if next tag is C A B SURROUNDTAG C D change A to B if surround- ing tags are C and D A B NEXTBIGRAM C D change A to B if next two tags are C and D A B PREVBIGRAM C D change A to B if previous two tags are C and D ______________________________________
After training the set of contextual rule templates described in Table I, 280 contextual rules are obtained. The resulting rule-based tagger performs as well as the state of the art taggers based upon probabilistic models and overcomes the limitations common in rule-based approaches to language processing: it is robust and the rules are automatically acquired. In addition, the tagger requires drastically less space than stochastic taggers. However, Brill's tagger is inherently slow.
Once the lexical assignment is performed, Brill's algorithm applies each contextual rule acquired during the training phase, one by one, to each sentence to be tagged. For each individual rule, the algorithm scans the input from left to right while attempting to trigger the rule. This simple algorithm is computationally inefficient for two reasons.
The first reason for inefficiency is the fact that an individual rule is attempted on each token of the input regardless of the fact that some of the current tokens may have been previously examined by attempting to apply the same rule at a previous position. The algorithm works as if each rule is a template of tags that is being slided next to the input. Consider, for example, the rule A B PREVBIGRAM C C that changes tag A to tag B if the previous two tags are C. When applied to the input C D C C A, three alignments are attempted and at each step no record of previous partial matches or mismatches are recorded, as can be seen from the following tables. ##STR1##
In this example, the second alignment could have been skipped by using the information from the first alignment.
The second reason for inefficiency is the potential interaction between rules. For example, when the rule 1 and rule 2 are applied to sentence
"He/pps witnessed/vbd Lenon/np killed/vbn by/by Chapman/np" the first rule results in the change:
"He/pps witnessed/vbd Lenon/np killed/vbd by/by Chapman/np"which is undone by the second rule resulting in
"He/pps witnessed/vbd Lenon/np killed/vbn by/by Chapman/np"
The algorithm may therefore perform unnecessary computation. In summary, Brill's algorithm for implementing the contextual tagger may require RCn elementary steps to tag an input of n words with R contextual rules requiring at most C tokens of context.