The present invention relates to a method an apparatus for processing natural language using operations performed on weighted and non-weighted multi-tape automata.
Finite state automata (FSAs) are mathematically well defined and offer many practical advantages. They allow for fast processing of input data and are easily modifiable and combinable by well defined operations. Consequently, FSAs are widely used in Natural Language Processing (NLP) as well as many other fields. A general discussion of FSAs is described in Patent Application Publication US 2003/0004705 A1 and in “Finite State Morphology” by Beesley and Karttunen (CSLI Publications, 2003), which are incorporated herein by reference.
Weighted finite state automata (WFSAs) combine the advantages of ordinary FSAs with the advantages of statistical models, such as Hidden Markov Models (HMMs), and hence have a potentially wider scope of application than FSAs. Weighted multi-tape automata (WMTAs) have yet more advantages. For example, WMTAs permit the separation of different types of information used in NLP (e.g., surface word form, lemma, POS-tag, domain-specific information) over different tapes, and preserve intermediate results of different steps of NLP on different tapes. Operations on WMTAs may be specified to operate on one, several, or all tapes.
While some basic WMTAs operations, such as union, concatenation, projection, and complementary projection, have been defined for a sub-class of non-weighted multi-tape automata (see for example the publication by Kaplan and Kay, “Regular models of phonological rule systems”, in Computational Linguistics, 20(3):331-378, 1994) and implemented (see for example the publication by Kiraz and Grimley-Evans, “Multi-tape automata for speech and language systems: A prolog implementation”, in D. Woods and S. Yu, editors, Automata Implementation, number 1436 in Lecture Notes in Computer Science, Springer Verlag, Berlin, Germany, 1998), there continues to be a need for improved, simplified, and more efficient operations for processing WMTAs to make use of these advantages in natural language processing.
In accordance with the invention, there is provided a method and apparatus for using weighted multi-tape automata (WMTAs) in natural language processing (NLP) that includes morphological analysis, part-of-speech (POS) tagging, disambiguation, and entity extraction. In performing NLP, operations are employed that perform cross-product, auto-intersection, and tape-intersection (i.e., single-tape intersection and multi-tape intersection) of automata. Such operations may be performed using transition-wise processing on weighted or non-weighted multi-tape automata.
In accordance with one aspect of the invention (referred to herein as the “tape-intersection” operation, or single-tape intersection for one tape or multi-tape intersection for a plurality of tapes), there is provided in a system for processing natural language, a method for intersecting tapes of a first multi-tape automaton (MTA) and a second MTA, with each MTA having a plurality of tapes and a plurality of paths. The method includes composing the first MTA and the second MTA by intersecting a first tape of the first MTA with a first tape of the second MTA to produce an output MTA. The first tape of the first MTA and the first tape of the second MTA corresponds to a first intersected tape and a second intersected tape of the output MTA, respectively. At least one of the first and the second intersected tapes from the output MTA is removed while preserving all its other tapes without modification.
In accordance with another aspect of the invention, there is provided in a system for processing natural language, a method for intersecting tapes of a first multi-tape automaton (MTA) and a second MTA, with each MTA having a plurality of tapes and a plurality of paths. The method includes: (a) computing a cross-product MTA using the first MTA and the second MTA; (b) generating string tuples for paths of the cross-product MTA; (c) for each string tuple generated at (b), evaluating whether the string of a first tape equals the string of a second tape; (d) for each string tuple evaluated at (c) having equal strings at the first and second tapes, retaining the corresponding string tuple in the cross-product MTA; (e) for each string tuple evaluated at (c) having unequal strings at the first and second tapes, restructuring the cross-product MTA to remove the corresponding string tuple; (f) removing redundant strings in the string tuples retained in the cross-product MTA at (d) to produce an output MTA.
In accordance with yet another aspect of the invention, there is provided in a system for processing natural language, a method for intersecting a first tape of a first multi-tape automaton (MTA) and a second tape of a second MTA, with each MTA having a plurality of tapes and a plurality of paths. The method includes: defining a simulated filter automaton (SFA) that controls how epsilon-transitions are composed along pairs of paths in the first MTA and the second MTA; building an output MTA by: (a) creating an initial state from the initial states of the first MTA, the second MTA, and the SFA; (b) intersecting a selected outgoing transition of the first MTA with a selected outgoing transition of the second MTA, where each outgoing transition having a source state, a target state, and a label; (c) if the label of the first tape of the selected outgoing transition of the first MTA equals the label of the second tape of the selected outgoing transition of the second MTA, creating (i) a transition in the output MTA whose label results from pairing the labels of the selected outgoing transitions, and (ii) a target state corresponding to the target states of the selected outgoing transitions and the initial state of the SFA; (d) if an epsilon transition is encountered on the first tape, creating a transition in the output MTA with a target state that is a function of (i) the target state of the outgoing transition of the first MTA, (ii) the source state of the outgoing transition of the second MTA, and (iii) a first non-initial state of the SFA; (e) if an epsilon transition is encountered on the second tape, creating a transition in the output MTA with a target state that is a function of (i) the source state of the outgoing transition of the first MTA, (ii) the target state of the outgoing transition of the second MTA, and (iii) a second non-initial state of the SFA; and (f) repeating (b)-(e) for each outgoing transition of the first MTA and the second MTA.
It will be appreciated that the present invention has the following advantages over weighted 1-tape or 2-tape processing of automata because it allows for: (a) the separation of different types of information used in NLP (e.g., surface form, lemma, POS-tag, domain-specific information, etc.) over different tapes; (b) the preservation of some or all intermediate results of various NLP steps on different tapes; and (c) the possibility of defining and implementing contextual replace rules referring to different types of information on different tapes.