The present invention relates to information processing, and more particularly to a method and apparatus for mapping multiword expressions to identifiers using finite-state networks.
Information processing ranges from tokenization, to morphological analysis, disambiguation, and parsing. These aspects of information processing and other aspects of language processing can be efficiently performed using finite-state networks. Such networks are compiled from regular expressions, a formal language for representing sets and relations. A relation is a set of ordered string pairs, where a string is a concatenation of zero or more symbols.
Finite-state networks have been used to develop a contextual dictionary lookup system for multiword expressions. To correctly interpret the meaning of multiword expressions, they need to be recognized as complex lexical units because one multiword expression may take on many variations. For example, multiword expressions include idiomatic expressions (e.g., “to rack one's brains over”), proverbial sayings (e.g., “birds of a feather flock together”), phrasal verbs (e.g., “to come up with”), lexical and grammatical collocations (e.g., “with regard to”), compound terms (e.g., “online dictionary”).
Examples of a system for processing multiword expressions are disclosed by Silberztein in “INTEX: a corpus processing system”, published in Proceedings of COLING-94, Vol. 1, Kyoto, Japan, 1994. Other examples of systems for processing multiword expressions are disclosed in U.S. Pat. Nos. 5,644,774 and 5,845,306.
Another example system for processing multiword expressions is the Xerox Linguistic Development Architecture (XeLDA®) that provides as part of its linguistic services idiom recognition and contextual bi-lingual dictionary lookup. XeLDA uses an idiomatic regular expression language (IDAREX) for describing idiomatic expressions and an idiomatic expression compiler for incorporating regular expressions defined using IDAREX into finite-state networks. Contextual bi-lingual dictionary lookup in XeLDA is performed by retrieving a word's context and using that context to find its translation.
Further aspects of XeLDA are published in “XeLDA Overview” and “XeLDA C++ API Programmer's Guide”, Xerox XeLDA® the linguistic engine, June, 2002 and U.S. Pat. No. 6,321,372. In addition, further background concerning XeLDA's recognition of multiword expressions is described in U.S. Pat. Nos. 5,642,522 and 6,393,389, which are incorporated herein by reference, and the disclosure by Bauer et al., “LOCOLEX: the translation rolls off your tongue”, published in Proceedings of ACH-ALLC, Santa-Barbara, USA, 1995.
More specifically, contextual bi-lingual dictionary lookup is performed in XeLDA by segmenting input text into sentences. Each sentence is segmented into words, morphologically analyzed, and disambiguated before being compiled into a sentence finite-state network. Each word of the sentence is looked up in a language dictionary. For each entry in the language dictionary that has an associated finite-state network of idioms, the sentence finite-state network is matched against complete paths in the associated finite-state network of idioms. The collection of complete paths in the associated finite-state networks of idioms identifies the idioms (or multiword expression) for the input sentence.
Even though every word in XeLDA's language dictionaries are not associated with a finite-state network of idioms, there exist significant memory demands at runtime for loading and unloading these networks even if caching is used. In addition, because an input sentence is processed word-by-word in XeLDA, the same finite-state network of idioms may be referenced by more than one entry in a dictionary, possibly leading to efficiency losses in which the network is loaded and unloaded from memory multiple times and the same idiom is matched multiple times. Also, idioms made up of short words may be missed because efficient word-by-word processing requires skipping words less than a predefined number of characters long (e.g., 3).
Accordingly it would be desirable to provide an improved system for recognizing multiword expressions that overcomes these and other limitations of existing systems and methods for identifying multiword expressions.