This invention relates generally to the fields of computer systems and data processing. More particularly, given a first data sequence, an apparatus and method are provided for identifying and retrieving a second data sequence associated with or related to the first.
In many data processing environments, it is necessary to map a given sequence of data (e.g., text, numbers) to a related sequence or set of data. Thus, a database system may receive a sequence of characters that a user wishes to search for in a help file, within titles or subjects of books, in a glossary, etc. The user""s sequence of characters may reflect just one of multiple different sequences that are equivalent.
For example, in the Unicode character set, a composite character may be received as a sequence of base or other characters. Thus, the composite character {umlaut over ({acute over (u)})} may be received as any of the following sequences: ({umlaut over ({acute over (u)})}), (u,{umlaut over ( )},{acute over ( )}) (u,{acute over ( )},{umlaut over ( )}),(xc3xc,{acute over ( )}) or (xc3xa,{umlaut over ( )}). Upon receipt of a user""s sequence, the database system may need to lookup and retrieve another equivalent sequence or, in this example, a composite character that is equivalent to the sequence. The database may be configured to make such translations in order to do searches with a particular form of a sequence of characters. By changing all equivalent sequences to one form, the database can normalize its search procedure. Thus, a Unicode sequence that can be composed to one normalized or composite character may be replaced, for purposes of the user""s search, with the composite character.
A sequence given by the user that will be mapped or translated into another (e.g., equivalent) sequence or character may be virtually any length. And, since any permutation of the equivalent sequence may be received, the system must have mappings for each one. Thus, storing mappings for all possible sequences of a given set of data, and all permutations of the sequences, may require a large amount of storage space.
One manner in which such mappings have been stored is as a multilevel index tree, wherein each node of the tree comprises a fixed-size array. Each array element represents one character that may be part of the given sequence, with a pointer to the node that stores the next member(s) of the sequence. The node and element corresponding to the last datum or item of a user""s sequence may store or identify the equivalent sequence. Index trees may require an enormous amount of storage space in order to accommodate all possible lengths of given data sequences. For example, the number of array elements in each node may be equal to the number of possible values of each character or item in the given sequence. And, because many of the tree nodes may be less than fully populated, much of the storage space may be wasted.
In another method of storing data mappings and sequences, a hash table is employed. A hash function is applied to a user""s given sequence to generate a key. The key serves as an index to a corresponding element of a hash table or array. The array element identifies or lists (e.g., via a linked list or other structure) data sequences that hash to the same key value and, for each such sequence, a corresponding equivalent sequence. To find the appropriate target data sequence in the linked list, each member of the list must be examined. One of them will store the given data sequence and the destination sequence. Although a hash table may be more efficient in terms of the amount of storage space that is required, finding and retrieving a target sequence for a particular given sequence may be relatively slow because each member of a linked list associate with a key value must be examined.
Thus, in one embodiment of the invention an apparatus and methods are provided for storing and/or retrieving a target data sequence in response to a given data sequence. This embodiment requires a relatively small amount of storage space and thus provides a high speed of operation (e.g., retrieval of the target sequence).
In this embodiment, a key data sequence is received, which may be composed of letters, numbers or other symbols, and an equivalent, translated, normalized or other related target data sequence is retrieved if it exists. For example, in one implementation of this embodiment, a key sequence of multiple Unicode characters is received and a single normalized or composite equivalent character is retrieved. A single target data sequence may be identified and retrieved for any number of different given data sequences.
In an embodiment of the invention, the data structure used to store and retrieve target sequences may be considered a virtual tree. Illustratively, the virtual tree starts at a root, the size of which (e.g., number of cells) may be equivalent to the possible values of the first datum, item or other unit of the given data sequence (e.g., character, byte, word, symbol). The virtual tree also includes virtual blocks of variable sizes (i.e., comprising a variable number of nodes), and leaves that are also of variable sizes and which contain target data sequences of variable lengths. The virtual tree is traversed for a given or key data sequence by first locating a root cell that corresponds to the first unit within the key sequence. That cell will identify (e.g., by memory address or offset) the virtual block that contains a node corresponding to the next unit. That node will also store a memory offset or pointer to the next virtual block having a node corresponding to the next item, and so on. The node corresponding to the final item of the key sequence identifies the leaf node that contains the target data sequence.
In one embodiment of the invention the virtual blocks of a virtual tree are compressed, in that they contain few, if any, empty nodes, and may overlap. Thus, the nodes of one virtual block may be interleaved (i.e., in memory or storage) with nodes of another virtual block. Therefore, in this embodiment, each node of a virtual block includes a home block field that identifies the address or offset of the node""s virtual block. If, while traversing the tree, a node is reached that has a home block field that does not match the address or offset through which the node was accessed (i.e., the node""s home block), then the traversal may be terminated because this condition may be interpreted as indicating that the key data sequence has no corresponding target data sequence. Nodes may also include the address or offset of the next virtual block (or leaf) to be visited in the traversal or lookup process. In one embodiment, an address or offset of zero indicates that tree traversal should be terminated because the given key sequence does not map to a target sequence.
In another embodiment of the invention, a leaf contains a target data sequence and, possibly, the length of the contained sequence (e.g., especially if target sequences are variable in length). Also, a leaf may contain a pointer or other reference (e.g., an address or offset) to another virtual block if there is another key data sequence that is longer than, and which starts with the present key data sequence.