The present invention relates to data processing, and more particularly to a technique of indexing a database using the path numbers of an acyclic finite-state transducer (FST).
In the following, an index refers to a device that associates alphanumeric keys with one or more addresses in a database. If the database is a physical object, such as a book the index might be a list of pairs consisting of a keyword or a phrase and a list of numbers identifying the pages where the word or phrase appears in the book. If the database is an electronic document, such as a machine readable dictionary, the index might be a list of pairs that consist of a headword of a dictionary entry as a key, and a list of locations in a computer file or in the memory of the computer where the entry is stored.
For a computerised index, it is important to have an efficient method of computing the address corresponding to a key. The construction of a hash table is a well known way to achieve this purpose. It involves (a) selecting a function that assigns to each key some random numerical value in a chosen range and (b) storing the address associated with the key in the corresponding location in the hash table. The standard hashing method can be suboptimal in two ways: (1) the hash function may assign the same value to more than one key, and (2) some values are not assigned to any key at all. Because of (1), the address must be marked in some way to make it possible to determine which key they belong to; and (2) means that the hash table may be partially empty. If (1) or (2) holds, some space, and possibly search time, is wasted.
Word/number mapping
Both of the aforementioned problems can be avoided by finding a hash function that assigns to each key a unique value in a range that exactly matches the size of the hash table. This guarantees that every position in the table is filled with one and only one address; thus, it is not necessary to mark which key each address belongs to.
A perfect hash function of this type can be obtained by constructing a deterministic finite-state automaton that enumerates the keys. C. L. Lucchesi and T. Kowaltowski ("Applications of Finite Automata Representing Large Vocabularies", Software Practice and Experience, Vol. 23(1), pp. 15-30, January 1993) describe an algorithm that associates a unique number with every word accepted by a deterministic finite-state automaton. The numbers range from 0 to n-1, where n is the number of words accepted by the automaton. Because the size of the hash table can be the same as the number of keys, no space is wasted.
However, in some cases of database access even such a perfect hash function is not appropriate because (1) it assigns, by definition, to each key exactly one value and because (2) every value is associated with only one key. There are databases and applications for which neither (1) nor (2) is desirable. An example of that kind is an online dictionary. To illustrate a case in which it is desirable to provide multiple values for some keys, in the following description we will consider a simple example: searching for the location of the entry for the word "do" in an English dictionary.
Published European patent application EP-A-649,105 discloses a technique in which a stored word list can be used for word-to-number (W/N) and number-to-word (N/W) mapping. Each word in the list can be mapped to a unique number within a dense set of numbers ranging from zero to one less than the total number of words in the list. Some branches of the data structure can be skipped during mapping because of branching information associated with branch points. The branching information permits mapping to continue with a next branch or with an alternative branch. The branching information indicates the number of suffix endings in the next branch; this number is used to keep a count of the word endings during W/N mapping; it is also used both to determine whether to continue with the next branch and also to reduce the number being mapped during N/W mapping. The branching information can include a full length pointer to the next branch or a shorter length pointer index to a table in which the full length pointer is stored. In either case, the number of suffix endings in the next branch can be annexed to the pointer. Where sublists of words have identical suffixes, the suffixes can be collapsed into shared branches.
A typical dictionary does not provide a unique headword for each entry. Many words, including "do", appear as a headword in several entries. There is one entry for "do" in the sense of an activity, another entry for "do" as a note on a musical scale. Consequently, the query for the address of "do" should yield multiple answers: one for each entry with "do" as the headword.
The previous example assumes that user does not know in advance that "do" can be both a verb and a noun. But this is not necessarily true. In a sentence like "I want to do something," the occurrence of "do" can be easily identified as a verb. In this context, only the entry or entries concerned with the verb sense of "do" are relevant. The user should be able to narrow the query, say, to "do+V", thereby receiving only a pointer to the verb entry of "do" as the answer. (We use the symbol "+V" to refer to verbs, "+N" for nouns.)