1. Field of the Invention
The present invention relates generally to finite-state language processing, and more particularly, to a method and apparatus for constructing finite-state networks. In one of many applications, this method and apparatus have proved useful in modeling natural languages that have non-concatenative processes.
2. Description of Related Art
Many basic steps in language processing, ranging from tokenization to phonological and morphological analysis, disambiguation, spelling correction, and shallow parsing can be performed efficiently by means of finite-state transducers. Such transducers are generally compiled from regular expressions. Regular expressions are a formal language (i.e., metalanguage) that can be used to model a natural language (e.g., French, English, etc.) or a relation. Although regular expressions and methods for compiling them into finite-state automata have been part of elementary computer science for decades, the application of finite-state transducers to natural language processing has given rise to many extensions to the classical regular expression calculus.
The term “formal language” or simply “language” is used herein to refer to sets of strings of any kind. The terms “string” and “word” are used interchangeably herein. A string is a concatenation of zero or more symbols. In the examples set forth below, the symbols are, in general, single alphabetic characters such as “a”, but user-defined multicharacter symbols such as “+Noun” are also possible. Multicharacter symbols are considered as atomic entities rather than as concatenations of single character strings. A string that contains no symbols at all is called the empty string and the language that contains the empty string but no other strings is known as the empty string language. A language that contains no strings at all, not even the empty string, is called the empty language or null language. The language that contains every possible string of any length is called the universal language.
The term “relation” is used herein to describe a set of ordered string pairs such as {<“a”, “bb”>, <“cd”, “ ”>}. The first member of a pair is called the upper string, and the second member is called the lower string. A string-to-string relation is a mapping between two languages: the upper language and the lower language. They correspond to what is usually called the domain and the range of a relation. In this example, the upper language is {“a”, “cd”} and the lower language is {“bb”, “ ”}. A relation such as {<“a”, “a”>} in which every pair contains the same string twice is called an identity relation. If a relation pairs every string with a string that has the same length, the relation is an equal length relation. Every identity relation is obviously an equal length relation.
Regular expressions that denote a language compile into a “simple finite-state automaton”, whereas regular expressions that denote a relation compile into a “finite-state transducer”. The term “finite-state network” (FSN) or “network” as used herein covers both simple finite-state automata and finite-state transducers. A simple finite-state automaton, for example, is an FSN that can be used for recognizing word forms. In contrast, a finite-state transducer is an FSN that can be used for the generation or analysis of word forms. Simple finite-state automata and transducers will not be treated as different types of mathematical objects herein and will be described generally as finite-state networks (FSNs).
More specifically, an FSN is a directed graph that consists of states and labeled arcs. A directed graph is a computer data structure that can be used for computation in fields such as computational linguistics. An FSN contains a single initial state, also called the start state, any number of final states, and any number of labeled arcs leading from state to state. In the figures presented herewith, states are represented as circles and arcs (i.e., transitions) are represented as arrows. Each state acts as the origin for zero or more arcs leading to some destination state. A sequence of arcs leading from the initial state to a final state is called a “path”. A sequence of arcs leading from one state to any other state is a “subpath”. The set of subpaths of a given path includes the path. In a simple finite-state automaton, each path represents a string (e.g., a word) and each subpath represents a substring. In a transducer, each path represents an ordered pair of strings (e.g., words) and each subpath represents an ordered pair of substrings.
An FSN that encodes a simple finite-state automaton encodes transitions such that each transition has associated values on a single level, whereas an FSN that encodes a finite-state transducer encodes transitions such that each transition has associated values on more than one level. As a result, an FSN that encodes a finite-state transducer can respond to an input signal indicating a value on one of the levels by following a transition with a matching value on the level and by providing as output the transition's associated value at another level. A two-level transducer, for example, can be used to map between input and output strings.
At Xerox Corporation, a Xerox regular expression language that follows certain conventions has been defined for describing languages and relations. These conventions have been adopted herein to describe and illustrate the present invention. A feature of the Xerox convention is that simple automata and transducers that encode an identity relation are represented by the same FSN (i.e., the encoding of a single symbol is interpreted as a single symbol if a simple automaton and as an identity symbol pair if a transducer). In following Xerox convention, an arc of an FSN may be labeled either by a single symbol such as “a” or a symbol pair such as “a:b”, where “a” designates the symbol on the upper side of the arc and “b” the symbol on the lower side. If all the arcs of an FSN are labeled by a single symbol (e.g., “a”), the FSN is defined by the Xerox convention as a simple automaton. However, if at least one label of an arc in an FSN is a symbol pair, the FSN is defined by the Xerox convention as a transducer.
Also by Xerox convention, in the diagrams presented herein the start state of an FSN is always the leftmost state and final states are marked by a double circle. Further background relating to the use of finite-state networks in natural language processing at Xerox Corporation is disclosed in “Syntax and Semantics of Regular Expressions”; “Finite-State Networks”; “Application of Finite-State Networks”; and “Examples of Networks and Regular Expressions”, which are published on the Internet at http://www.xrce.xerox.com/research/mltt/fst/, and which are each incorporated herein by reference.
Unlike an FSN that is a simple finite-state automaton, an FSN that is a finite-state transducer is inherently bidirectional. Either side of a transducer can be used as the input side, with the other side being the output side. For example, a finite-state transducer can be used in mapping between different forms of words, such as between the surface forms (e.g., “try”, “tries”, “tried”) that occur in ordinary usage of a language and their related citation forms (e.g., “try”). It is conventional to augment a surface form's related citation form with other information about the surface form (“try+Inf”, “try+PresSg3”, “try+Past”) such that these forms can be read as analyses.
By arbitrary Xerox convention to be followed herein, surface forms (e.g., “try”, “tries”, “tried”) are encoded by the lower or surface side of a finite-state transducer; citation or analysis forms (e.g., “try+Inf”, “try+PresSg3”, “try+Past”) are encoded by the upper or lexical side of a finite-state transducer. A typical Xerox finite-state transducer therefore encodes a relation between a language of analysis strings, on the upper side, and a language of surface strings on the lower side.
It has long been known that an FSN can encode the mathematical entities referred to as languages and relations. An FSN is conventionally produced by an operation, referred to as “compilation”. The compiler takes as input a description of the language or relation to be encoded. Simple languages and relations are commonly described using a metalanguage called regular expressions. A “regular expression” belongs to a formal language in which some of its elements are operands while others refer to operations that are performed on the operands. For more background on regular expressions refer to a publication by Hopcroft and Ullman entitled “Introduction to Automata, Theory, Languages, and Computation”, Addison-Wesley, Reading Mass., 1979, and to a publication edited by J. Leeuwen, entitled “Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics,” Elsevier Science Publishers B. V. 1990. (Note that the term “rational expression” as used by Leeuwen is synonymous with the term regular expression as used herein.)
To “compile” a regular expression is to perform an operation that begins with the text of the regular expression and that produces an FSN that encodes the language or relation denoted by the regular expression. The FSN is a “compiled version” of the regular expression. A compiler that takes as input regular expressions and compiles an FSN is defined herein as a “regular expression compiler”. Other source notations (e.g., the Xerox language called lexc) are not technically regular expressions but have the same formal descriptive power and also compile into FSNs. Thus, regular expressions as referred to herein include other formalisms like lexc. Similarly, when regular expression compilers are referred to herein they include lexc compilers and any other compiler that compiles languages with the formal descriptive power of regular expressions.
A simple example of a regular expression using the Xerox regular-expression formalism is: [t r y ], which denotes the single-word language {“try”} and is formed by an operation that concatenates the symbols t, r, and y. An FSN that encodes this language can thus be obtained from the regular expression by compilation. Another example of a regular expression is: [[t|d ]r y], which denotes the language consisting of the two words {“try”, “dry”} and specifies that the language is formed by an operation that obtains the union of t and d, [t|d ], and then concatenates it with [r y ]. The union operation can thus be represented in regular expressions by the union operator “|”.
The structure of words in a natural language like English, referred to as “morphotactics”, may often be described exhaustively in terms of concatenation and union. Most natural languages construct words by concatenating morphemes together one after another in strict orders. A word constructed in this way can typically be analyzed as a basic stem, possibly preceded by one or more prefixes and followed by one or more suffixes. The English word “nonconfrontationally”, for example, can be analyzed as the stem “confront” preceded by the prefix “non” and followed by the suffixes “ation”, “al”, and “ly”. Prefixes, stems and suffixes are morphemes. A morpheme is the minimal meaning-bearing component of a word.
Morphological alternations such as the y/ie alternation seen in the various forms of the word “try” above are also conventionally represented using the Xerox regular-expression formalism by regular expressions containing the replace operator “—>”, which represents a replace operation, the context separator “∥”, and the indicator “_”, which indicates the site of the replacement between two contexts. For example, the regular expression:y—>ie∥Cons−[s|d].#.denotes a relation between pairs of strings that are the same except that instances of “y” in strings of the upper-side are replaced by instances of “ie” in the related strings of the lower side, but only if “y” occurs after a consonant and before “s” or “d” at the end of the word. An FSN obtained by compiling this regular expression can be applied to forms such as “trys” and “tryd” to produce the correct inflected forms “tries” and “tried”.
The application of an FSN to a language in this manner is referred to as “composition”, conventionally represented in regular expressions of the type described herein by the composition operator “.o.”. For example, the relation between the hypothetical forms {“try”, “trys”, “tryd”} and the corresponding correct forms {“try”, “tries”, “tried”} can be denoted by the regular expression:[try[0|s|d]].o.y—>ie∥Cons_[s|d].#.with the number “0” representing the empty string, also referred to as epsilon.
Known regular expression compilers can produce an FSN from such a regular expression. Such compilers must appropriately interpret regular expression operators which include: concatenation, union, replacement, and composition. FIG. 1 illustrates one conventional way to represent the resulting FSN in the form of a graph 10, with circles 20, 22, 24, 26, 28, and 30 representing states of the FSN and with arrows 40, 42, 44, 46, 48, 50, and 52 representing transitions from state to state.
In graph 10, each circle contains a number that identifies the state it represents, and each transition has a label that represents a constraint on the transition. Here finite-state networks are arbitrarily represented as Mealy machines, with labeled transitions, rather than as Moore machines, in which the labels are stored on states; because the two representations are equivalent and interchangeable, nothing substantial hinges on this choice. Circle 20, numbered 0, represents the start state of the FSN, while circle 26 is doubled to indicate a final state that can terminate an acceptable sequence of states and transitions. In graph 10 in FIG. 1, the labels represent constraints that include pairs of symbols, only one of which is shown if both are the same, as with transitions 40, 42, 44, 50, and 52.
Every path of a finite-state transducer represents a string or an ordered pair of strings. Each path 54, 56, and 58 shown in FIG. 2 therefore represents a pair of strings 60, 62, and 64, respectively, as shown in FIG. 3. For example, path 58 represents the pair of strings 64 “trys” and “tries”. In accordance with Xerox conventional techniques, an FSN represented by graph 10 can be applied “in a downward direction” to an input string while treating the upper-side symbol in each pair as a symbol that must be matched by an input symbol to make the transition and the lower-side symbol in each pair as an output symbol that is provided whenever the transition is made. Conversely, the same network can be applied “in an upward direction”, with the lower-side symbols matching input and the upper-side symbols being output. (It will be appreciated by those skilled in the art that many alternate paths and strings exist for the FSN 10 besides those shown in FIGS. 2 and 3.)
In addition to the industry standard concept of path through an FSN, which extends from the start state to a final state and encodes a string or ordered pair of strings, the notion of “subpath” is added herein. A subpath in an FSN extends from one state, not necessarily a start state, to another state, not necessarily a final state, via a sequence of arc transitions. Thus, FSN 10 in FIG. 1 also includes the subpaths 0-t-1-r-2, 0-t-1, 1-r-2-y-3, etc. The set of subpaths include the set of paths (i.e., 0-t-1-r-2-y-3 is both a path and a subpath), but not all subpaths are paths. Similarly, a subpath encodes a substring or pair of substrings. All strings are substrings but not all substrings are strings.
A “delimited subpath” refers to herein a subpath that encodes a substring, wherein the first symbol of the substring is preceded in an FSN by a predefined starting delimiter, and the last symbol is followed by a predefined ending delimiter. In one embodiment, the predefined starting delimiter and the predefined ending delimiter are arbitrarily selected as “^[” and “^]”, respectively. A “delimited substring” is a string of symbols on a subpath bounded by the predefined starting delimiter and the predefined ending delimiter. When the FSN is a transducer, a subpath may be a delimited subpath on the upper side, the lower side, or simultaneously on both sides.
Although most natural languages construct words by concatenating morphemes together one after another in strict orders, many natural languages exhibit morphotactic processes that cannot be straightforwardly modeled by concatenation. Such processes are called “nonconcatenative morphotactics processes” or simply “nonconcatenative processes”. In Arabic, for example, stems are formed by a process known as “interdigitation”, while in Malay, plurals are formed by a process known as “full stem reduplication”. Although both Arabic and Malay also include prefixation and suffixation that can be modeled by concatenation in the usual way, a complete lexicon cannot be obtained without nonconcatenative processes.
More specifically, interdigitation and other processes that result in discontinuous morphemes cannot be modeled solely by concatenation of constituent morphemes. An example of interdigitation occurs with the Arabic stem “katab”, which means “wrote”. As analyzed by McCarthy, J. J., “A prosodic theory of nonconcatenative morphology”, Linguistic Inquiry, Vol. 12, No. 3, 1981, pp. 373–418, this stem consists of an all-consonant root “ktb” whose general meaning has to do with writing, an abstract consonant-vowel template CVCVC, and a voweling or vocalization symbolized simply as “a”, signifying perfect aspect and active voice. The root consonants are associated with the C slots of the template and the vowel or vowels with the V slots, producing a complete stem “katab”. If the root and the vocalization are thought of as morphemes, neither morpheme occurs continuously in the stem. The same root “ktb” can combine with the template CVCVC and a different vocalization “ui”, signifying perfect aspect and passive voice, producing the stem “kutib”, which means “was written”. Similarly, the root “ktb” can combine with CVVCVC and “ui” to produce “kuutib”, the root “drs” can combine with CVCVC and “ui” to form “duris”, and so forth.
Like interdigitation, full reduplication of this type cannot be modeled solely by concatenation of sublexicons. An example of full stem reduplication occurs with the Malay stem “bagi”, which means “bag” or “suitcase”. Its plural is “bagibagi”, formed by repeating the stem twice in a row. Although this pluralization process may appear concatenative, it does not involve concatenating a predictable pluralizing morpheme, but rather copying the preceding stem, whatever it may be and however long it may be.