Many applications within the field of text processing involve replacing a string in one language with one or more elements from a second language. Typically, the language of the input string is denoted "UPPER," while the language of the output string is denoted "LOWER." (Note that UPPER and LOWER are not necessarily two different natural languages.) For example, one application reduces the amount of memory required to store a text by inserting tags within the text so that all the text appearing between the tags appears or disappears without changing the text in any other way. Generally, finite state transducers are used to replace a string in UPPER with a string in LOWER. Finite state transducers replace a string of a regular language with regular expressions. As used herein, a language is regular if it can be parsed by a finite state machine into a string of regular expressions. For a more technical definition of "regular language" see J. Hopcroft & J. Ullman, Introduction to Automata Theory, Languages, and Computation, 1979. Unfortunately, finite state transducers compiled from simple replace expressions are generally nondeterministic. The illustration of FIG. 1 aids the explanation of why nondeterministic finite state transducers pose a problem. Discussion of FIG. 1 is, in turn, aided by a brief review of regular expression formalisms and notational conventions. The formalisms and notational conventions used herein are essentially those described in R. Kaplan & M. Kay, Regular Models of Phonological Rule Systems, Computational Linguistics, 20:3, pp. 331-378 (1994) and L. Karttunen, The Replace Operator, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistic, ACL-94, pp. 16-23. Upper-case strings, like UPPER, represent regular languages, while lower case strings, like x, and ab represent symbols. Two types of symbols are recognized: unary and symbol pairs. Unary symbols are represented as: a, b, c, etc.; while symbol pairs can be represented, for example as follows: a:x, b:0. A symbol pair of the form a:x may be thought as the cross product of the upper symbol, a, and the lower symbol, x. To make this notation less cumbersome the distinction between a language A and the identity relation that maps every string of A into itself is systematically ignored. Thus, a:a is also written as just a.
Regular expressions may use a number of special symbols: 0, ?, and %. Zero, 0, represents the empty string, which is also often denoted by [text2html.sub.-- wrap.sub.-- inline615]. A question mark, ?, stands for any symbol in the known alphabet and its known extensions. The percent sign, %, functions as an escape character, which allows letters that have a special meaning in the calculus to be used as ordinary symbols. Thus, the square bracket, [, which has special meaning as a grouping symbol becomes simply a square bracket using the notation %[. Similarly, 0, the empty string symbol, becomes merely zero given the notation %0.
Two simple expressions are used frequently. The first is [], which denotes the empty string language. The second simple expression is ?*, which denotes the universal language, also called the sigma star. A number of regular expression operators are used herein. * zero or more is known as the Kleene star. + one or more denotes the Kleene plus. .about. not represents the complement operation. The contains operation is represented by $, while the ignore operation is represented by /. Union, also called or, is denoted by .vertline.. & represents the and operation, also known as the intersection operation. The relative complement operation is represented by the minus symbol, -. .x. denotes the crossproduct operation, .o. denotes the composition operation, and .fwdarw. denotes the simple replace operation.
Given this explanation of notation, consider now the transducer diagram of FIG. 1. Finite state transducer 30 includes three states 32, 34, and 36 and several transitions 40, 42, 44, 46, 48, 50 and 52. States are represented via circles, nonfinal states, like state 36, via a single circle and final states, such as states 32 and 34 are represented by a circle within a circle. Initial transducer states, like state 32, are indicated by the number 0 within the circle. For this reason the initial state is often called the 0 state. Each transition between states is labeled with symbols, with the ? symbol used to indicate symbols that are not explicitly present in the network. Transitions that differ only with respect to label are collapsed into a single, multiply labeled arc, such as transition 50, for example.
Finite state transducer 30 represents a simple replace for the union of ab, b, ba, aba with x; i.e., ab.vertline.b.vertline.ba.vertline.aba .fwdarw.x. Applying input string aba and analyzing the possible output strings illustrates the nondeterministic behavior of finite state transducer 30. As shown by FIG. 2, application of the input string aba can produce four different output strings, axa, ax, xa, and x, because there are four paths in transducer 30 that contain aba on the upper side of the transitions with different strings on the lower side of the transitions. Stated another way, transducer 30 produces four alternate ways to partition the upper input string aba. The replacement expression axa results from starting at state 32, taking transition 40 back to state 32, taking transition 42 to state 34 and transition 50 back to state 32. This route through transducer 30 can be notated as &lt;0 a 0 b:x 2 a 0&gt;, where in general numbers indicate states and the symbols labels on transition, with the exception that 0 is used to indicate both a state and parts of transition labels. Thus, replacement expression ax results from &lt;0 a 0 b:x 2 a:0 0&gt;. Similarly, &lt;0 a:x 1 b:0 2 a 0&gt;yields the replacement expression xa and &lt;0 a:x 1 b:0 2 a:0 0&gt;gives rise to the replacement expression x. Thus, transducer 30 yields multiple results even though the lower language consists of a single string. This is called nondeterminism.
Nondeterminism is frequently associated with transducers compiled from simple replace expressions, like UPPER .fwdarw.LOWER. Nondeterminism arises in two different ways, as discussed in L. Kartunnen, Constructing Lexical Transducers, Proceedings of the Fifteenth International Conference on Computational Linguistics, Coling 94, I, pp. 406-411 (1994); and in A. Kempe & L. Karttanen, Parallel Replacement in the Finite-State Calculus, Proceedings of the Sixteenth International Conference on Computational Linguistics, Coling 96 (1996). One way nondeterminism arises is from allowing a replacement to begin at any point within the input string. Thus, different replacement strings result for the input string aba depending on whether replacement begins at the beginning of the string or with b. Nondeterminism also arises because there may be multiple, alternate, replacements given the same starting point. For example, given input string aba and choosing to begin at the beginning of the string either ab or aba may be replaced via transition 40. Thus, nondeterministic transducers yield multiple results even if the lower language consists of a single string.