1. Field of Invention
This invention is directed to a compiler system and a method for compiling context-dependent rewrite rules with input strings to obtain rewritten output strings. More particularly, this invention is directed to weighted rewrite rules. Specifically, this invention relates to context-dependent weighted rewrite rules for natural language and speech processing.
2. Description of Related Art
Rewrite rules correspond to converting one regular expression into another regular expression. The context-dependent rewrite rules limit the application of the rewrite portion of the rule based on a context portion that defines the regular expressions which precede and/or follow the regular expression to be rewritten. Thus, context-dependent rewrite rules are written as: EQU .phi..fwdarw..psi./.lambda..sub.-- .lambda.
where:
.phi. is the regular expression to be replaced, PA1 .psi. is the regular expression to be inserted in place of .phi., PA1 .lambda. (for "left") is the regular expression which appear to the left of .phi., and PA1 .rho. (for "right") is the regular expression appearing to the right of .phi.. PA1 &lt;.sub.a and &gt;.sub.a define the start and end of a rule application, provided they appear in front of the proper .phi., and between the proper .phi. and .rho., respectively; PA1 &lt;.sub.i and &gt;.sub.i define the identity portions of the strings, the regions between the changes by a replacement relation; and PA1 &lt;.sub.c and &gt;.sub.c identify strings which come in the middle or center in another rule application that starts to the left of the &lt;.sub.c and extends to the right of the &gt;.sub.c. The Prologue.sup.-1 finite-state transducer deletes any of the &lt; or &gt; which remain after application of the other finite-state transducers.
It should be appreciated that either .lambda. or .rho. can be omitted. When either .lambda. or .rho. are omitted, the rewrite rule is dependent only on the left or right context, and is independent of the other, omitted, context.
Context-dependent rules have additional dependencies beyond those defining the right and left adjacent regular expressions. These include application dependencies and traversal dependencies. The application dependencies include whether the rewrite rule is an obligatory rule or an optional rule. An obligatory rewrite rule is one that must be applied to rewrite .phi. as .psi. in the string of regular expressions whenever the left and right contexts .lambda. and .rho. are found. In contrast, in an optional rewrite rule, application of the rewrite rule to the input string is optional.
The traversal dependencies include left-right, right-left, and simultaneous. In a left-right context dependent rewrite rule, the .lambda. context precedes .phi., the regular expression to be replaced, and the right context .rho. follows it. Furthermore, as soon as the proper context .lambda. and .rho. are identified for the regular expression to be rewritten, .phi. can be immediately replaced with .psi.. In this case, if a second context-dependent rewrite rule used .phi. as part or all of its left context .lambda.', in a left-right traversal, because .phi. is rewritten to .psi. by the first rewrite rule, the second rewrite rule using .phi. as a portion of its left context would not be applied, as the context .lambda.' of that second rule, which includes .phi., no longer exists.
In contrast, in a right-left traversal context-dependent rewrite rule, the right context .rho. precedes the regular expression to be rewritten, .phi., while the left context .lambda. follows it. The right-left traversal context-dependent rewrite rule is nonetheless similar to the left-right context-dependent rewrite rule, in that, if .phi. forms part or all of another rewrite rule to be applied to the input string of regular expressions, and is rewritten to .psi., then the preceding context of that second rewrite rule will not be found.
In contrast to both left-right and right-left context-dependent rewrite rules, simultaneous context-dependent rewrite rules have no "direction" of application. That is, they are applied simultaneously to all ordered subsets of the input string of regular expressions at the same time. Thus, in the examples used above, if two rewrite rules are to be applied simultaneously to an input string, where one of the rewrite rules rewrites the string ".lambda..phi..rho." as ".lambda..psi..rho.", and the other rewrite rule rewrites the string ".phi..rho..rho.'" as ".phi..psi.'.rho.'),both of these simultaneous rewrite rules will be applied to the input string ".lambda..phi..rho..rho.'" to rewrite it to the output string ".lambda..psi..psi.'.rho.'."
Context-dependent rewrite rules are often used in natural language and speech processing areas, including morphology, phonology, syntax, text-to-speech processing, and speech recognition. While context-dependent rewrite rules have been most commonly encountered in a natural language and speech processing, context-dependent rewrite rules are not limited to natural language and speech processing.
Context-dependent rewrite rules can be represented by finite-state transducers under the condition that no such rule rewrites its non-contextual part. In other words, when an occurrence of the regular expression .phi. is found in a string and replaced with .psi., .psi. can be used as a left or right context for further replacements, but it cannot be used as part of an occurrence .phi. to be replaced.
This condition is equivalent to the one described in Formal Aspects of Phonological Description, by C. Douglas Johnson (1972), herein incorporated by reference, that no rule be allowed to apply any more than a phonetic number of times to its own output: both conditions limit the representational power of context-dependent rewrite rules exactly to the set of all finite-state transducers.
Finite-state transducers allow convenient algebraic operations such as union, composition, and projection. Due to their increased computational, finite-state transducers enable very large machines to be built. These machines can be used to model interestingly complex linguistic phenomena.
The use of context-dependent rewrite rules to represent linguistic phenomena is also described by Johnson. As disclosed in "Regular Models of Phonological Rule Systems", Ronald M. Kaplan et al., Computational Linguistics, 20:331-378 (1994), herein incorporated by reference, each such rule can be modeled by a finite-state transducer. Furthermore, since regular relations are closed under serial composition, a phonetic set of rules applying to each other's output in an ordered sequence also defines a regular relation. Therefore, a single finite-state transducer, whose behavior simulates the whole set, can be constructed by composing the individual finite-state transducers corresponding to the individual rules. Kaplan then describes a method for converting a context-dependent rewrite rule into a finite-state transducer. In particular, Kaplan's method uses six finite-state transducers for each rule. For each rule, these six finite-state transducers are composed together to create the final finite-state transducer for that rule. For a system of rules, a union of the finite-state transducer for each rule defines the finite-state transducer for the set of rules. As shown in FIG. 1, Kaplan's system includes, for an obligatory left-right context-dependent rewrite rule, a Prologue finite-state state transducer, an Id(Obligatory) finite-state transducer, a Id(Rightcontext) finite-state transducer, a Replace finite-state transducer, a Id(Leftcontext) finite-state transducer and a Prolgue.sup.-1 finite-state transducer. The Prologue finite-state transducer adds three distinct left marker symbols and three distinct right marker symbols: EQU &lt;.sub.a, &lt;.sub.i, &lt;.sub.c, and &gt;.sub.a, &gt;.sub.i, &gt;.sub.c,
where
FIG. 2A shows the basic finite-state transducer for the obligatory left-right rewrite rule "a.fwdarw.b/c.sub.-- b." FIG. 2A also indicates how each transition corresponds to the .phi., .psi., .lambda. and .rho. regular expressions. In general, the accepted notation for transitions of a finite-state transducer is of the form .varies.:.beta., where .varies. is the input string and .beta. is the output string. To simplify the figures described herein, those transitions that output their input string, i.e., transitions that are normally labeled ".varies.:.varies.", will be labeled only with the input string, i.e., only as ".varies.". In addition, those transitions that accept and output one of a number of input strings, and thus would normally be labeled ".varies..sub.1 :.varies..sub.1 ; .varies..sub.2 :.varies..sub.2 ; . . . ", will be labeled ".varies..sub.1, .varies..sub.2, . . . ". Furthermore, it should be appreciated that "a", "b" and "c" can be single symbols of an alphabet .SIGMA. or strings of single symbols. It should also be appreciated that "d" represents all symbols and/or strings of symbols of .SIGMA. other than "a", "b" and "c".
FIG. 2B shows a finite-state transducer corresponding to the input string "cab", to which is to be applied the obligatory left-right rewrite rule "a.fwdarw.b/c.sub.-- b." FIG. 2C shows the composition of "cab" with Prologue. FIG. 2G shows the finite-state transducer remaining after the other finite-state transducers are composed with the composition of "cab" and Prologue. FIG. 2H shows the composition of the finite-state transducer shown in FIG. 2G with Prologue.sup.-1.
FIG. 2D shows the composition of Id(Obligatory) with the finite-state transducer shown in FIG. 2C. In particular, as shown in FIG. 2D, the Id(Obligatory) finite-state transducer splits the finite-state transducer shown in FIG. 2C into two paths between the states 1 and 3 such there are two paths "a" and "&lt;.sub.i " extending from state 1. Because only the path through state 5 has both the proper right context "b" and the &lt;.sub.i, only state 5 has its right markers deleted. Thus, the right markers are removed from state 5.
The Id(Rightcontext) finite-state transducer is then composed with the finite-state transducer shown in FIG. 2D, resulting in the finite-state transducer shown in FIG. 2E. As shown in FIG. 2E, the Id(Rightcontext) finite-state transducer removes the right markers from the remaining states and collapses states 2, 4 and 5 of FIG. 2C into states 2 and 3 of FIG. 2E. Furthermore, states 2 and 3 of FIG. 2E are connected by three paths marked, respectively, &gt;.sub.a, &gt;.sub.i, and &gt;.sub.c.
The Replace finite-state transducer is then composed with the finite-state transducer of FIG. 2E, resulting in the finite-state transducer of FIG. 2F. As shown in FIG. 2F, the Replace finite-state transducer generates two parallel paths between states 1 and 3. The first path contains the original input string path, while the second path contains the rewritten input string. Furthermore, the left and right markers &lt;.sub.a, &lt;.sub.c, &gt;.sub.a and &gt;.sub.c are removed from states 0-4 while &lt;.sub.a identifies the transition from state 1 to state 5, &lt;.sub.c defines the transitions looping at states 5 and 6, and &gt;.sub.a defines the transition from state 6 to state 3.
The Id(Leftcontext) finite-state transducer is then composed with the finite-state transducer shown in FIG. 2F, resulting in the finite-state transducer of FIG. 2G. Because the left context "c" of the rewrite rule is present, the Id(Leftcontext) finite-state transducer deletes the path which does not contain the replacement, the path from state 1 to state 3 through state 2, as well as deleting the &lt;.sub.i and &lt;.sub.c markers. Then, as noted above, the Prologue.sup.-1 finite-state transducer is composed with the finite-state transducer shown in FIG. 2G, resulting in the finite-state transducer shown in FIG. 2H.
As apparent from the above-outlined description of Kaplan's method, Kaplan introduces the sets of left and right markers in the Prologue finite-state transducer, only to delete them in the following finite-state transducers depending on whether the proper context is present. Furthermore, the construction of the Id(Obligatory), Id(Rightcontext), and Id(Leftcontext) finite-state transducers involves many operations, including two intersections of automata, two distinct subtractions, and nine complementations. Furthermore, each subtraction itself involves an intersection and a complementation. Thus, in total, four intersections and eleven complementations must be performed for each rule when composing the six finite-state transducers of Kaplan's method with any arbitrary input string.
While intersection and complementation are classical automata algorithms, applying intersection and complementation is very costly. For example, the complexity of intersection is quadratic. Moreover, complementation requires that the input automaton be determinized. In this context, determinization can be very time consuming and lead to automata of very large size. This occurs because the complexity of determinization is exponential.