1. Field of the Invention
The invention relates to finite-state language processing, and more particularly to methods for efficiently processing finite-state networks in language processing and other applications.
2. Description of Related Art
Many basic steps in language processing, ranging from tokenization to phonological and morphological analysis, disambiguation, spelling correction, and shallow parsing can be performed efficiently by means of finite-state transducers. Such transducers are generally compiled from regular expressions, a formal language for representing sets and relations. Although regular expressions and methods for compiling them into automata have been part of elementary computer science for decades, the application of finite-state transducers to natural-language processing has given rise to many extensions to the classical regular-expression calculus.
The term language is used herein in a general sense to refer to a set of strings of any kind. A string is a concatenation of zero or more symbols. In the examples set forth below, the symbols are, in general, single characters such as xe2x80x9caxe2x80x9d, but user-defined multicharacter symbols such as xe2x80x9c+Nounxe2x80x9d are also possible. Multicharacter symbols are considered as atomic entities rather than as concatenations of single-character strings. A string that contains no symbols at all is called the empty string and the language that contains the empty string but no other strings is known as the empty string language. A language that contains no strings at all, not even the empty string, is called the empty language or null language. The language that contains every possible string of any length is called the universal language.
A set of ordered string pairs such as { less than xe2x80x9caxe2x80x9d, xe2x80x9cbbxe2x80x9d greater than ,  less than xe2x80x9ccdxe2x80x9d, xe2x80x9cxe2x80x9d greater than } is called a relation. The first member of a pair is called the upper string, and the second member is called the lower string. A string-to-string relation is a mapping between two languages: the upper language and the lower language. They correspond to what is usually called the domain and the range of a relation. In this case, the upper language is {xe2x80x9caxe2x80x9d, xe2x80x9ccdxe2x80x9d} and the lower language is {xe2x80x9cbbxe2x80x9d, xe2x80x9cxe2x80x9d}. A relation such as { less than xe2x80x9caxe2x80x9d, xe2x80x9caxe2x80x9d greater than } in which every pair contains the same string twice is called an identity relation. If a relation pairs every string with a string that has the same length, the relation is an equal-length relation. Every identity relation is obviously an equal-length relation.
Finite-state automata are considered to be networks, or directed graphs that consist of states and labeled arcs. A network contains a single initial state, also called the start state, and any number of final states. In the figures presented herewith, states are represented as circles and arcs are represented as arrows. In the included diagrams, the start state is always the leftmost state and final states are marked by a double circle. Each state acts as the origin for zero or more arcs leading to some destination state. A sequence of arcs leading from the initial state to a final state is called a path. An arc may be labeled either by a single symbol such as xe2x80x9caxe2x80x9d or a symbol pair such as xe2x80x9ca:bxe2x80x9d, where xe2x80x9caxe2x80x9d designates the symbol on the upper side of the arc and xe2x80x9cbxe2x80x9d the symbol on the lower side. If all the arcs of a network are labeled by a single symbol, the network is called a simple automaton; if at least one label is a symbol pair the network is a transducer. Simple finite-state automata and transducers will not be treated as different types of mathematical objects herein. The framework set forth herein reflects closely the data structures in the Xerox implementation of finite-state networks.
A few simple examples illustrating some linguistic applications of finite-state networks are set forth below. The following sections will describe how such networks can be constructed.
Every path in a finite-state network encodes a string or an ordered pair of strings. The totality of paths in a network encodes a finite-state language or a finite-state relation. For example, the network illustrated in FIG. 1 encodes the language {xe2x80x9cclearxe2x80x9d, xe2x80x9ccleverxe2x80x9d, xe2x80x9cearxe2x80x9d, xe2x80x9ceverxe2x80x9d, xe2x80x9cfatxe2x80x9d, xe2x80x9cfatterxe2x80x9d}.
Each state in FIG. 1 has a number, thereby facilitating references to paths through the network. There is a path for each of the six words in the language. For example, the path  less than 0-e-3-v-9-e-4-r-5 greater than  represents the word xe2x80x9ceverxe2x80x9d. A finite-state network is a very efficient encoding for a word list because all words beginning and ending in the same way can share a part of the network and every path is distinct from every other path.
If the number of words in a language is finite, then the network that encodes it is acyclic; that is, no path in the network loops back onto itself. Such a network also provides a perfect hash function for the language, a function that assigns or maps each word to a unique number in the range from 0 to nxe2x88x921, where n is the number of paths in the network.
The network illustrated in FIG. 2 is an example of a lexical transducer. It encodes the relation { less than xe2x80x9cleaf+NNxe2x80x9d, xe2x80x9cleafxe2x80x9d greater than ,  less than xe2x80x9cleaf+NNSxe2x80x9d, xe2x80x9cleavesxe2x80x9d greater than ,  less than xe2x80x9cleft+JJxe2x80x9d, xe2x80x9cleftxe2x80x9d greater than ,  less than xe2x80x9cleave+NNxe2x80x9d, xe2x80x9cleavexe2x80x9d greater than ,  less than xe2x80x9cleave+NNSxe2x80x9d, xe2x80x9cleavesxe2x80x9d greater than ,  less than xe2x80x9cleave+VBxe2x80x9d, xe2x80x9cleavexe2x80x9d greater than ,  less than xe2x80x9cleave+VBZxe2x80x9d, xe2x80x9cleavesxe2x80x9d greater than ,  less than xe2x80x9cleave+VBDxe2x80x9d, xe2x80x9cleftxe2x80x9d greater than }. The substrings beginning with xe2x80x9c+xe2x80x9d are multicharacter symbols.
In order to make the diagrams less cluttered, it is traditional to combine several arcs into a single multiply-labeled arc. For example, the arc from state 5 to state 6 abbreviates four arcs that have the same origin and destination but a different label: xe2x80x9c+NN:0xe2x80x9d, xe2x80x9c+NNN:sxe2x80x9d, xe2x80x9c+VB:0xe2x80x9d, xe2x80x9c+VBZ:sxe2x80x9d. In this example, xe2x80x9c0xe2x80x9d is the epsilon symbol, standing for the empty string. Another important convention illustrated in FIG. 2 is that identity pairs such as xe2x80x9ce:exe2x80x9d are represented as a single symbol xe2x80x9cexe2x80x9d. Because of this convention, the network in FIG. 1 could also be interpreted as a transducer for the identity relation on the language.
The lower language of the lexical transducer in FIG. 2 consists of inflected surface forms xe2x80x9cleafxe2x80x9d, xe2x80x9cleavexe2x80x9d, xe2x80x9cleavesxe2x80x9d, and xe2x80x9cleftxe2x80x9d (i.e., language to be modeled). The upper language consists of the corresponding lexical forms or lemmas, each containing a citation form of the word followed by a part-of-speech tag.
Lexical transducers can be used for analysis or for generation. For example, to find the analyses for the word xe2x80x9cleavesxe2x80x9d, one needs to locate the paths that contain the symbols xe2x80x9clxe2x80x9d, xe2x80x9cexe2x80x9d, xe2x80x9caxe2x80x9d, xe2x80x9cvxe2x80x9d, xe2x80x9cexe2x80x9d, and xe2x80x9csxe2x80x9d as such on the lower side of the arc label. The network in FIG. 2 contains three such paths:
0-1-1-e-2-a-3-v-4-e-5-+NNS:s-6,
0-1-1-e-2-a-3-v-4-e-5-+VBZ:s-6,
0-1-1-e-2-a-3-f:v-8-+NNS:e-9-0:s -6.
The result of the analysis is obtained by concatenating the symbols on the upper side of the paths: xe2x80x9cleave+NNSxe2x80x9d, xe2x80x9cleave+VBZxe2x80x9d, and xe2x80x9cleaf+NNSxe2x80x9d.
The process of generating a surface form from a lemma, say xe2x80x9cleave+VBDxe2x80x9d, is the same as for analysis except that the input form is matched against the upper side arc labels and the output is produced from the opposite side of the successful path or paths. In the case at hand, there is only one matching path:
0-1-1-e-2-a:f-12-v:t-13-e:0-14-+VBD:0-6
This path maps xe2x80x9cleave+VBDxe2x80x9d to xe2x80x9cleftxe2x80x9d, and vice versa.
The term xe2x80x9capplyxe2x80x9d is used herein to describe the process of finding the path or paths that match a given input and returning the output. As the example above shows, a transducer can be applied downward or upward. There is no privileged input side. In the implementation described here, transducers are inherently bi-directional.
Lexical transducers provide a very efficient method for morphological analysis and generation. A comprehensive analyzer for a language such as English, French, or German contains tens of thousands of states and hundreds of thousands of arcs, but it can be compressed to a relatively small size in the range of approximately 500 KB to 2 MB.
A relation may contain an infinite number of ordered pairs. One example of such a relation is the mapping from all lowercase strings to the corresponding uppercase strings. This relation contains an infinite number of pairs such as  less than xe2x80x9cabcxe2x80x9d, xe2x80x9cABCxe2x80x9d greater than ,  less than xe2x80x9cxyzzyxe2x80x9d, xe2x80x9cXYZZYxe2x80x9d greater than , and so on. FIG. 3 sketches the corresponding lower/upper case transducer. The path that relates xe2x80x9cxyzzyxe2x80x9d to xe2x80x9cXYZZYxe2x80x9d cycles many times through the single state of the transducer. FIG. 4 shows that path in linearized form.
The lower/upper case relation may be thought of as the representation of a simple orthographic rule. In fact, all kinds of string-changing rules may be viewed in this way, that is, as infinite string-to-string relations. The networks that represent phonological rewrite rules, two-level rules, or the GEN relation in Optimality Theory are of course in general more complex than the simple transducer illustrated in FIG. 3.
FIG. 4 may also be interpreted in another way, that is, as representing the application of the upper/lower case rule to the string xe2x80x9cxyzzyxe2x80x9d. In fact, rule application is formally a composition of two relations; in this case, the identity relation on the string xe2x80x9cxyzzyxe2x80x9d and the upper/lower case relation in FIG. 3.
A composition is an operation on two relations. If one relation contains the pair  less than x, y greater than  and the other relation contains the pair  less than y, z greater than , the relation resulting from composing the two will contain the pair  less than x, z greater than . Composition brings together the xe2x80x9coutsidexe2x80x9d components of the two pairs and eliminates the common one in the middle. For example, the composition of { less than xe2x80x9cleave+VBDxe2x80x9d, xe2x80x9cleftxe2x80x9d greater than } with the lower/upper case relation yields the relation { less than xe2x80x9cleave+VBDxe2x80x9d, xe2x80x9cLEFTxe2x80x9d greater than }.
It is useful to have a general idea of how composition is carried out when string-to-string relations are represented by finite-state networks. Composition is advantageously thought of as a two-step procedure. First, the paths of the two networks that have a matching string in the middle are lined up and merged, as shown in FIG. 5. For the sake of perspicuity, the upper and lower symbols are shown explicitly on different sides of the arc except that zero (i.e., epsilon) is represented by a blank. The string xe2x80x9cleftxe2x80x9d is then eliminated in the middle, yielding the transducer in FIG. 6 that directly maps xe2x80x9cleave+VBDxe2x80x9d to xe2x80x9cLEFTxe2x80x9d.
Once rule application is thought of as composition, it immediately can be seen that a rule can be applied to several words, or even infinitely many words at the same time if the words are represented by a finite-state network. Lexical transducers are typically created by composing a set of transducers for orthographic rules with a transducer encoding the source lexicon. Two rule transducers can also be composed with one another to yield a single transducer that gives the same result as the successive application of the original rules. This is a well-known fundamental insight in computational phonology.
The formal properties of finite-state automata are considered briefly below. All the networks presented in this background have the three important properties defined Table 1.
If a network encodes a regular language and if it is epsilon-free, deterministic and minimal, the network is guaranteed to be the best encoding for that language in the sense that any other network for the same language has the same number of states and arcs and differs only with respect to the order of the arcs, which generally is irrelevant.
The situation is more complex in the case of regular relations. Even if a transducer is epsilon-free, deterministic, and minimal in the sense of Table 1, there may still be another network with fewer states and arcs for the same relation. If the network has arcs labeled with a symbol pair that contains an epsilon on one side, these one-sided epsilons could be distributed differently, or perhaps even eliminated, and this might reduce the size of the network. For example, the two networks in FIGS. 7 and 8 encode the same relation, { less than xe2x80x9caaxe2x80x9d, xe2x80x9caxe2x80x9d greater than ,  less than xe2x80x9cabxe2x80x9d, xe2x80x9cabxe2x80x9d greater than }. They are both deterministic and minimal but one is smaller than the other due to a more optimal placement of the one-sided epsilon transition. In the general case there is no way to determine whether a given transducer is the best encoding for an arbitrary relation.
For transducers, the intuitive notion of determinism makes sense only with respect to a given direction of application. But there are still two ways to think about determinism, as shown in Table 2.
Although the transducers in FIGS. 7 and 8 are functional (i.e., unambiguous) in both directions, the one in FIG. 7 is not sequential in either direction. When it is applied downward, to the string xe2x80x9caaxe2x80x9d, there are two paths that have to be pursued initially, even though only one will succeed. The same is true in the other direction as well. In other words, there is local ambiguity at the start state because xe2x80x9caxe2x80x9d may have to be deleted or retained. In this case, the ambiguity is resolved by the next input symbol one step later.
If the relation itself is unambiguous in the relevant direction and if all the ambiguities in the transducer resolve themselves within some fixed number of steps, the transducer is called sequentiable. That is, an equivalent sequential transducer in the same direction can be constructed. FIG. 9 shows the downward sequentialized version of the transducer in FIG. 7.
The sequentialization process combines the locally ambiguous paths into a single path that does not produce any output until the ambiguity has been resolved. In the case at hand, the ambiguous path contains just one arc. When a xe2x80x9cbxe2x80x9d is seen, the delayed xe2x80x9caxe2x80x9d is produced as output and then the xe2x80x9cbxe2x80x9d itself in a one-sided epsilon transition. Otherwise, an xe2x80x9caxe2x80x9d must follow, and in this case there is no delayed output. In effect, the local ambiguity is resolved with one symbol lookahead.
The network in FIG. 9 is sequential but only in the downward direction. Upward sequentialization produces the network shown in FIG. 8, which clearly is the best encoding for this simple relation.
Even if a transducer is functional, it may well be unsequentiable if the resolution of a local ambiguity requires an unbounded amount of lookahead. For example, the simple transducer illustrated in FIG. 10 cannot be sequentialized in either direction.
This transducer reduces any sequence of xe2x80x9caxe2x80x9ds that is preceded by a xe2x80x9cbxe2x80x9d to an epsilon or copies it to the output unchanged depending on whether the sequence of as is followed by a xe2x80x9ccxe2x80x9d. A sequential transducer would have to delay the decision until it reached the end of an arbitrarily long sequence of xe2x80x9caxe2x80x9ds. It is clearly impossible for any finite-state device to accumulate an unbounded amount of delayed output.
However, in such cases it is always possible to split the functional but unsequentiable transducer into a bimachine, as will be described in further detail below. A bimachine for an unambiguous relation consists of two sequential transducers that are applied in a sequence. The first half of the bimachine processes the input from left-to-right; the second half of the bimachine processes the output of the first half from right-to-left. Although the application of a bimachine requires two passes, a bimachine is in general more efficient to apply than the original transducer because the two components of the bimachine are both sequential. There is no local ambiguity in either the left-to-right or the right-to-left half of the bimachine if the original transducer is unambiguous in the given direction of application. FIGS. 11 and 12 together show a bimachine derived from the transducer in FIG. 10.
The left-to-right half of the bimachine (FIG. 11) is only concerned about the left context of the replacement. A string of xe2x80x9caxe2x80x9ds that is preceded by xe2x80x9cbxe2x80x9d is mapped to a string of xe2x80x9ca1xe2x80x9ds, an auxiliary symbol (or diacritic) to indicate that the left context has been matched. The right-to-left half of the bimachine (FIG. 12) maps each instance of the auxiliary symbol xe2x80x9ca1xe2x80x9d either to xe2x80x9caxe2x80x9d or to an epsilon depending on whether it is preceded by xe2x80x9ccxe2x80x9d when the intermediate output is processed from right-to-left.
The bimachine in FIGS. 11 and 12 encodes exactly the same relation as the transducer in FIG. 10. The composition of the left-to-right half (FIG. 11) of the bimachine with the reverse of the right-to-left half (FIG. 12) yields the original single transducer (FIG. 10).
In accordance with the invention, there is provided a method, and apparatus therefor, for extracting xe2x80x9cshortxe2x80x9d ambiguity from an arbitrary finite-state transducer (FST). Generally, the method factorizes an original FST into a first factor and a second factor. The first factor, T1, contains most of the original FST, and the second factor, T2, contains those parts of the ambiguity of the original FST that are one arc long, regardless of finite or infinite ambiguity.
In accordance with one aspect of the invention, a method extracts short runs of ambiguity from an input finite-state transducer (FST) having a plurality of states and arcs, an input side, and an output side. Initially, at least one set of arcs is identified in the input FST. Each set of arcs has a plurality of arcs that identify a single-arc ambiguity field with a common input symbol. A first factor is generated by assigning a diacritic to the output side of each arc within a set of arcs. A second factor is generated by having a single state and a set of ambiguous arcs. At least one of the ambiguous arcs in the set maps a diacritic to an output symbol.