The invention relates to finite state data structures in which paths represent paired strings of tags and tag combinations.
Techniques are well known for performing part-of-speech tagging of machine-readable text by automatically obtaining tags for words in a sentence or other text segment or corpus. Each tag can identify a word""s part-of-speech (e.g. noun-singular, verb-present-1st person plural, etc.). Such tags typically have a standardized form, such as the form specified under the Text Encoding Initiative (TEI), described at the Web site of the TEI Consortium, http://www.tei-c.org/. The tags for a text segment or corpus can be used, for example, in information retrieval from the text and statistical analysis of the text.
Schabes et al., U.S. Pat. No. 5,610,812, describe a contextual tagger that uses a deterministic finite state transducer (FST) to improve tagging speed. The contextual tagger is constructed by examining a training corpus of tagged text to acquire a set of rules and by then transforming the rules into a deterministic FST using non-deterministic transducers, a composer, and a determinizer. To tag an input sentence, the sentence is initially tagged by assigning to each word its most likely part-of-speech tag regardless of surrounding words, and the deterministic FST is then applied to the resulting sequence of part-of-speech tags using the surrounding words to obtain final part of speech tags. In other words, the deterministic FST is applied to the initial sequence of tags to obtain the final sequence of tags.
Chamiak, E., xe2x80x9cStatistical Techniques for Natural Language Parsingxe2x80x9d, AI Magazine, Winter 1997, pp. 33-43 describes techniques that use probabilistic Hidden Markov Model (HMM) based taggers. Charniak describes an HMM as a finite automaton in which the state transitions have probabilities and whose output is also probabilistic, and shows a fragment of an HMM for tagging in which each state is labelled with a part of speech, in which state-to-state transitions have probabilities, and in which output words for a state also have probabilities. Such taggers select, from among the part-of-speech tags that are found for a word in the dictionary, the most probable tag based on the word""s context, i.e. based on the adjacent words.
Chanod, J.-P., and Tapanainen, P., xe2x80x9cTagging Frenchxe2x80x94Comparing a Statistical and a Constraint Based Methodxe2x80x9d, in Proceedings of the 7th Conference of the EACL, Dublin, Ireland, ACL, 1995, pp. 149-156, compare a statistical approach with a constraint-based approach to part-of-speech tagging. Both approaches tokenize with an FST to obtain words and perform morphological analysis with a transducer lexicon to obtain a set of part-of-speech tags for each word. The statistical approach then disambiguates the sets of tags using an HMM, which can be thought of as encoding a relation between two languages, one including sequences of ambiguity classes and the other including sequences of tags. In contrast, the constraint-based approach disambiguates the sets of tags with FSTs derived from a set of rules written by linguists. It is well known that such an FST may be non-deterministic, i.e. for an input sentence it may provide more than one solution or no solutions at all.
The invention addresses problems that arise with conventional automatic tagging techniques.
Some conventional techniques use each word in a sequence to obtain a most probable tag, as exemplified by the initial tagger described by Schabes et al. Such techniques produce many tagging errors, however, because a word that has several possible tags will often occur with tags other than its most probable tag. Other conventional techniques attempt to correct tagging errors based on context, as exemplified by the contextual tagger of Schabes et al. Still other conventional techniques, exemplified by the HMM-based taggers of Charniak and by the statistical and constraint-based taggers of Chanod and Tapanainen, reduce tagging errors by obtaining one of a word""s possible tags based in part on adjacent words.
In general, conventional techniques are subject to a tension between accuracy and speed that limits performance. HMM-based taggers, for example, can be very accurate but are relatively slow in comparison with FST taggers; for example, an HMM-based tagger implemented as an FST modified to include weights on transitions is slower than an FST without weights because significant additional computation is necessary to use the weights. On the other hand, conventional FST taggers are generally less accurate than HMM taggers. Conventional FST tagging techniques that employ an FST automatically derived from a hand-tagged corpus seem to reach a limit of accuracy beyond which improvement can only be obtained by time-consuming additional measures to correct errors. Conventional FST tagging techniques that employ an FST derived from manually written constraints can achieve greater accuracy but require a great deal of work to write the constraints.
The invention alleviates the tension between speed and accuracy with new techniques that can be used to provide fast, accurate tagging. The techniques can be understood by considering that conventional disambiguators like those described by Chanod and Tapanainen, above, behave like a sequential transducer that deterministically maps a sequence of ambiguity classes (corresponding, for example, to a sentence) into a unique sequence of tags, e.g.:   "AutoLeftMatch"                              [          DET          ]                                      [                      ADJ            ,            NOUN                    ]                                      [                      ADJ            ,            NOUN                    ]                            …                              [          END          ]                                    DET                    ADJ                    NOUN                    …                    END            
Each technique provided by the invention involves a finite state data structure (FSDS) that can be used to map strings of tag combinations to strings of tags for tokens in a language. Each tag combination can be a logical combination of tags, such as an ambiguity class which is a conjunction of a set of tags. The FSDS can include paths representing pairs of strings. A first string in a pair can be a string of tag combinations, while a second can be a string of tags. The second strings of a set of paths with the same first string can include only highly probable strings of tags for the first string. A path can be machine accessible using its first string to obtain its second string.
The new techniques can thus be used to perform finite state tagging in two stages: A first stage uses tokens in a text to obtain a string of tag combinations, such as for a sentence or other string of tokens. A second stage uses the string of tag combinations to obtain a string of tags. If the string of tag combinations is the first string of a set of paths as described above, the string of tags is a highly probable disambiguation of the string of tag combinations.
Because the FSDS can be an FST, it is not surprising that the new techniques permit tagging at speeds comparable to conventional FST tagging. But it is surprising that the new techniques provide tagging with a significantly improved combination of speed and accuracy, approaching the accuracy of HMM taggers.
The new techniques can be implemented in an article of manufacture that includes an FSDS as described above, in a machine that includes such an FSDS, and in methods for using and transferring such an FSDS.
The FSDS can, for example, be an FST or it can be a bimachine that includes two FSTs. In either case, the first and second strings of a pair of strings represented by a path in the FSDS can be of equal length, with each tag combination in the first string having a counterpart tag in the second string. At least one tag combination in the first string can include two or more tags, and the counterpart tag in the second string can be one of the tags in the tag combination. The tags can be part-of-speech tags, so that each tag combination includes one or more parts of speech.
An FST as described above can be produced in several different ways. Some ways obtain a set that includes possible tags for tokens in the language and another set that includes tag combinations, then use the two sets to produce an FST. Other ways obtain a first partial set of tag combination sequences ending with a tag combination with only one tag, then obtain a second partial set of combination-tag pair sequences in which the tag combinations in each sequence in the first partial set are paired with tags, and then use the second partial set to produce an FST.
Some ways produce FSTs that approximate a Hidden Markov Model (HMM) used for part-of-speech tagging. For example, the FST can be (a) an xe2x80x9cn-typexe2x80x9d approximation of an HMM, which is a simple FST that can be applicable in all cases, and can be derived from HMM probability matrices; (b) an xe2x80x9cs-typexe2x80x9d approximation of an HMM, which is a precise HMM-approximating transducer for selected cases taken from a training corpus; (c) a completed precise s-type transducer, completed for example with sequences from a simple n-type transducer, to make the precise transducer applicable in all cases; or (d) a xe2x80x9cb-typexe2x80x9d approximation of an HMM, which is an HMM-approximating transducer that can have variable precision and can be applicable in all cases.
A bimachine as described above can include two FSTs, a first FST for mapping from strings of complete tag combinations to strings of partial tag combinations and a second FST for mapping from strings of partial tag combinations to strings of tags. The second FST, considered separately, is thus an FST as described above, and the bimachine itself is also an FSDS as described above.
The new techniques can be implemented in a method of tagging a text using an FSDS as described above.
The new techniques can also be implemented in a system that includes an FSDS as described above and a processor that accesses the FSDS to perform tagging. The new techniques can also be implemented in a method for transferring an FSDS as described above from one machine to another machine.
Compared with HMM taggers of approximately the same accuracy, an advantage of the new techniques is that they permit tagging speeds up to ten times higher than when using the HMM taggers. An important reason for this is that, in selecting highly probable tags, conventional HMM tagging techniques perform complex computations using probabilities or weights for transitions. In contrast, the new techniques make it possible to obtain a highly probable string of tags directly from a finite state data structure without performing computations using weights.
Another advantage of the new techniques is that, unlike conventional HMM taggers, FSDSs can be integrated with tools that perform further steps of text analysis. An FST, for example, can be handled by the finite state calculus and therefore directly integrated into other finite-state text processing tools, such as those available from Xerox Corporation and elsewhere. Since the tagger is often in the middle of a chain of text analysis tools where all the other tools may be finite-state-based (which can be the case with text processing tools available from Xerox Corporation), performing tagging with an FST or other FSDS rather than with an HMM tagger can make this chain homogeneous, thus making it possible to merge the chain""s components into one single FST, by means of composition.
Implementations of the new techniques that approximate an HMM are further advantageous because, in comparison with tagging techniques that employ an FST automatically derived from a hand-tagged corpus, the new techniques can obtain an FST automatically from an untagged corpus: First an HMM can be trained on the untagged corpus; then, an FST can be automatically obtained from the HMM.
In addition, the new techniques have applications other than tagging, such as in speech recognition.
Bimachine implementations of the new techniques are further advantageous because, in comparison with HMM taggers and with single FSTs, the bimachine data structures require significantly less memory.