1. Field of the Invention
The present invention relates generally to the field of formal language theory and more particularly to the field of computerized string-number mapping using finite-state machines.
2. Background of the Invention
Word-to-number and number-to-word mapping have been used for years and the techniques have been published in various places (see for example the publications entitled: “The World's Fastest Scrabble Program”, by A. Appel and G. Jacobson, published in Communications of the ACM, 31(5):572–578, 1998; “Applications Of Finite Automata Representing Large Vocabularies” by C. L. Lucchesi and T. Kowaltowski, published in Software-Practice and Experience, 32(1):15–30, 1993; and “Finite-State Tools For Language Processing”, by Emmanuel Roche, published in ACL'95 (Association for Computational Linguistics), 1995.).
Generally, word-to-number mapping relates each whole string (i.e. “word”) in a finite language with a unique integer in a dense range. The published technique is known to work only with finite-state networks that encode a finite language; where the language contains n strings (words), the technique relates each string with a unique integer in the range 0 to (n−1) or, in the trivial Luccesi and Kowaltowsky variant cited above, with an integer in the range 1 to n. A finite-state network that accepts a language is referred to herein as “an acceptor” or “an acceptor network”.
The principal use of word-to-number mapping (and the inverse, number-to-word mapping) is as a perfect hashing function, allowing efficient integer-indexed mapping from each whole string to “related information” that includes: definitions, translations, glosses, thesaurus sense groups, or other arbitrary data associated with that whole string.
For example, the following U.S. patents relate to the use of word-to-number mapping, and the inverse number-to-word mapping: U.S. Pat. No. 5,325,091, entitled “Text-Compression Technique Using Frequency-Ordered Array of Word-Number Mappers”; U.S. Pat. No. 5,523,946, entitled “Compact Encoding of Multi-Lingual Translation Dictionaries”; U.S. Pat. No. 5,787,386, entitled “Compact Encoding of Multi-Lingual Translation Dictionaries; and U.S. Pat. No. 5,754,847, entitled “Word/Number and Number/Word Mapping”.
Word-number mapping has also been extended to finite-state transducers, see for example U.S. Pat. No. 5,950,184, entitled “Indexing a Database by Finite-State Transducer” (hereinafter referred to as “the '184 patent”). In a typical scenario using the technique disclosed in the '184 patent a transducer is applied to an input word that is ambiguous, yielding multiple output strings. Word-to-number mapping is then performed on the output strings, returning multiple indices. Thus the whole input word is related to a set of numbers, which can be used as indices to retrieve multiple glosses. The English fly, for example, is ambiguous and might, via the mapping of a finite-state transducer, be analyzed as “fly [Verb]” and as “fly [Noun]”; a straightforward English-to-Spanish glossing application would need to retrieve the gloss “volar” for the verb and “mosca” for the noun by using the unique index assigned to each output string by word-to-number mapping.
2.1 Classic Word-Number Mapping
Word-to-number mapping and number-to-word mapping (referred to herein together as “word-number mapping”) are described here as background while referring to FIGS. 1–5.
2.2 Preparation for Word-Number Mapping
Before classic word-to-number mapping or number-to-word mapping can be performed using an acceptor network, the acceptor must be pre-processed to add integer counts on the nodes. As an example, a five-word acceptor is shown in FIG. 1.
More specifically, FIG. 1 shows an acceptor for the language consisting of the five words “clear”, “clever”, “ear”, “ever”, and “other”. The acceptor will accept these five words and will reject all other words. Each word corresponds to a path of labels on the arcs leading from the start state (i.e., the leftmost state shown in FIG. 1) to a final state, which is conventionally represented as a double circle.
The preprocessing performed for word-number mapping may be summarized as follows:
Begin by marking each non-final state with 0 and each final state with 1 as shown in FIG. 2. FIG. 2 shows an acceptor for the language consisting of the five words “clear”, “clever”, “ear”, “ever”, and “other”, initialized with a count of zero on each non-final node and a count of one on each final node.
Subsequently, consider each state in turn, adding one to the count for each path leading from that state to a final state. The result is shown in FIG. 3. FIG. 3 shows an acceptor for the language consisting of the five words “clear”, “clever”, “ear”, “ever”, and “other”, completely initialized for word-number mapping with counts on the nodes. Note that 5 strings can be completed from the start state, being the five strings of the language encoded by the acceptor.
2.3 Word-to-Number Mapping
Classic word-to-number mapping and number-to-word mapping work only for finite acceptors, i.e. for networks encoding finite languages. In other words, these classic techniques do not work for networks encoding infinite languages or for transducers.
Word-to-number mapping takes as input a word from the language of the transducer and maps the word to an integer in a dense range from 0 to (n−1), where n is the finite number of strings in the language. (The “dense range” means that there are no gaps in the range; each word corresponds to a unique integer in the dense range 0 to (n−1), and each number in the range corresponds to a unique word). An example of program instructions for performing word-to-number mapping is shown in FIG. 4.
Using the instructions shown in FIG. 4, the words of the language defined by the acceptor in FIG. 3 are mapped to the following integers: clear: 0; clever: 1; ear: 2; ever: 3; other: 4. That is, the five words of the language are mapped to unique integers in the dense range of 0 to (5−1).
2.4 Number-to-Word Mapping
Number-to-word mapping is the inverse operation of word-to-number mapping. For a language of n words, number-to-word mapping maps each integer in the dense range 0 to (n−1) to a unique word in the language. An example of program instructions for performing number-to-word mapping is shown in FIG. 5. In considering the “arcs leading out of the current state”, this includes the virtual “exit arc” in the case of final states.
2.5 Summary of Word-Number Mapping
A finite-state network encoding a regular language generally consists of a set of states, one designated as the start state, zero or more designated as final states, and labeled and directed arcs representing transitions from one state to another state. Each path from the start state to a final state corresponds to a word in the language encoded by the network. If the network is non-cyclic, i.e. if it contains no loops and therefore denotes a finite language, the language will have a finite cardinality n.
Word-to-number mapping uses the finite-state network to relate each of the n strings in the language to a unique integer in the dense range 0 to (n−1), and number-to-word mapping is the inverse operation, providing a perfect hash function. The techniques and applications are well described in the literature, especially in the Lucchesi and Kowaltowski paper cited above.
Exit arcs from a state are ordered. In word-to-number mapping, calculation of the unique index number for each whole string in the language of the network involves initializing an index-count to zero, “looking up” the string in the network, i.e. following the path corresponding to the symbols of the string from the start state to a final state, and adding to the index-count the counts on the destination states of arcs that are bypassed, in a lexicographic order, during the process of lookup.
The “lexicographic order” concerns the sorting priority from the beginning to end of the strings in the network (i.e., the primary sort is performed on the first character, within that the next sort is performed on the second character, etc.). While a lexicographic order may suggest that the labeled arcs leaving each state are sorted in alphabetical order, they may alternatively be ordered arbitrarily at each state. In addition, while lexicographic order may suggest that the same ordering of labeled arcs leaving each state is required at each state, it may alternatively be possible for each state to have a unique order arbitrarily different from the ordering of labeled arcs leaving any other state.
Number-to-word mapping is the straightforward inverse of word-to-number mapping. To retrieve a string given its index x, an index count is initialized to x and sets the start state of the network as the current state. From the current state, the counts on the states that can be reached from the current state are examined; and working in lexicographic order, the maximum number of states whose collective count value does not exceed the index-count are bypassed, the index-count is decremented by that collective count, and the next transition to the destination state is followed. That state becomes the new current state, and the technique re-applies repeatedly until the index-count is zero and a final state has been reached. The string of symbols on the path followed is returned as the word corresponding to the original index-count x.
Established techniques of word-number mapping are sensitive only to the real start state and to the real final states of the network. The technique applies only to networks denoting a finite language, indexing whole strings as units. In word-number mapping, there is only one indexing domain for the entire language. The number derived from word-to-number mapping is typically used as an index into an array of pointers (or offsets) into a database containing related information.
Accordingly, it would be advantageous to provide a word-number mapping technique that applies to networks encoding infinite languages with sufficient granularity to operate on substrings of the strings.