The present invention relates to language processing. In particular, the present invention relates to converting a surface text into a lexical representation.
In language processing, it is common to convert a surface form of a word into a lexical form to remove variations in the spelling of the word caused by the morphology associated with different parts of speech. For example, the surface form of “happiness” would be converted into the lexical form “happy+ness” and “found” would be converted into “find” with a marker for past tense added to the lexical form. Such conversions simplify later processing of the words because fewer variations of the words need to be supported.
A common method of performing such conversions involves the use of Finite State Transducers. In a Finite State transducer, two states are connected by a transition that maps a character in the surface form of the word to a character or marking in the lexical form. Under many systems, the Finite State Transducers are generated based on a set of rules that describe the mapping from a character in the surface form to a character in the lexical form. Some of these rules include a left context, a right context, or both, that require more than two states in the Finite State Transducer. For example, if a rule for a conversion from i to y included a left context of “p:p”, which requires a “p” in the surface form before the letter i and a right context of “n:n”, which requires an “n” after the letter i, a complete Finite State transducer would include a beginning state, a transition for the letter p to a second state, a transition for the conversion i:y from the second state to a third state, and a transition for the letter “n” from the third state.
Two-level morphology Finite State Transducers are used to create a lexical form of a word by applying the surface form as input to the Finite State Transducers. At each state, a Finite State Transducer determines if the current character in the input can be used to take a transition from the current state, to a next state. If so, the Finite State Transducer moves along the transition to the next state and selects the next character in the input. If the current character does not match any of the transitions out of a state, the Finite State Transducer fails and returns to the beginning state of the Finite State Transducer.
Under the prior art, each portion of a rule: the left context, the core, and the right context, was defined as separate Finite State Transducers. Each of these Finite State Transducers was separately converted into a binary representation that could be used during morphology processing, also known as runtime.
At runtime, the various Finite State Transducers were combined dynamically based on the user input, thereby generating a single virtual FST that was tailored to the input.
Although combining Finite State Transducers at runtime provides working morphology systems, it greatly slows the morphology process.