The present invention relates to computer systems and more particularly to a method and system for optimizing the combination of a direct table and trees for storing data.
Currently, direct tables and decision trees are utilized for searching for and accessing information, or items, in computer systems. For example, a large number of items may be desired to be stored. The items may be identified by a key, which is typically a binary number string of a particular length. Based on the key, information is desired to be obtained from the table. For example, the key might indicate a particular piece of data or that a specific action is to be taken.
If a direct table is used alone, without a tree, the direct table would typically have an entry for each unique value of the key. However, the key is typically several digits long. For example, one conventional key might be one hundred and ninety-two bits long. In order to ensure that a unique entry exists for every possible value of the key, the direct table would need to contain 2192 entries. Such a direct table is too large to be practical.
Thus, a smaller direct table utilizing a shorter key, which will be termed a hash value, is typically used. The smaller key is termed a hash value because the key is typically hashed in order to obtain the shorter key. Thus, in order to obtain the hash value, the key is operated on by a hash function. The hash function combines or otherwise operates on the digits of a particular value of the key in order to obtain a hash value which may be unique to that value of the key. In general, it is desired that the hash value be a random hash of the key in order to avoid tying the hash value to closely to specific information related to the key. However, no specific hash function is required.
The direct table typically has an entry for each unique hash value. Because the hash value can be significantly shorter than the key, the direct table can be significantly smaller than a direct table for the key. For example, a key which is ninety-two bits long may be hashed to hash values which are on the order of ten bits long. A direct table for a hash value having a ten-bit length need only have 210, or 1024, entries. The width of such a direct table is defined by the length of the hash value. Thus, a direct table using a ten-bit hash value is ten bits wide.
There can be more items to be stored in the direct table using the hash value or key than there are entries in the direct table. This is true for a direct table having a width, W, for which 2W is less than the number of items to be stored or if the same hash value is selected for different keys. Consequently, a plurality of items may reside in the same entry and be described by the same hash value. If a search is carried out based on the hash value only, the computer system may be unable to provide a unique decision or item for a hash. value. Such a situation is known as a collision. A double collision in which two items correspond to the same hash value, a triple collision in which three items correspond to the same hash value, or a higher order collision may occur. In order to resolve these collisions, a tree is utilized in conjunction with the direct table. Note that the same is true when a key is used in lieu of a hash value and 2W is less than the number of items to be stored or if more than one item can correspond to the same key.
A tree provides a mechanism for obtaining additional information and making decisions based on the information in order to uniquely identify the item desired to be found. For example, the tree might utilize additional bits from the key. Based on the additional information, the tree makes a decision at a node between two paths. Each piece of additional information utilized corresponds to a node. The items which collide are stored at the leaves of the tree at the end of a particular decision path through the tree. Thus, the combination of the direct table and the tree can be used to search for each of the items.
FIG. 1, for example depicts a direct table 10 and corresponding decision trees. However, for the purposes of clarity, only decision trees 50 and 60 are labeled and discussed. In addition, the number and size of the corresponding trees for the direct table 10 is for explanation only and not meant to represent a particular preferred distribution. The direct table 10 is four bits wide, corresponding to a hash value or key of 4 bits. The direct table 10 thus has sixteen entries 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40 and 42. Suppose, however, that up to twenty-four items are desired to be found using the direct table 10 and trees 50 and 60. Because there are more items than entries 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40 and 42, at least one of the entries 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40 and 42 corresponds to more than one item. For example, entries 16, 20, 26, 36 and 40 correspond to no items. Entries 12, 22, 32 and 34 correspond to a single item. Entries 14 and 28 correspond to two items. Entries 24, 30 and 42 correspond to three items. Entry 18 and entry 38 correspond to four items. Thus, there is a quadruple collision for the entry 18.
Where an entry corresponds to more than one item, a decision tree is used, for example, in order to uniquely identify an item corresponding to the entry 18 and the entry 38, a tree 50 and 60, respectively, will be used. Note that only a single tree corresponds to a particular collision. Because four items correspond to the entry 18 and the entry 38, the trees 50 and 50 each have four leaves 54, 55, 56, 57 and 64, 65, 66, and 67, respectively, which correspond to the four items. Depending upon the additional information for the four items used by the trees 50 and 60, the items may be more difficult to identify. For example, the tree 50 makes a decision at node 51 corresponding to a first piece of information. The decision at node 51 leads to node 52 or 53. Another piece of information obtained for either node 52 or node 53 leads to one of the leaves 54, 55, 56, or 57 and, therefore, to the item desired. Thus, two decisions, one at node 51 and another at either node 52 or node 53, lead to the item. In contrast, for the tree 60 that corresponds to the four items stored in the entry 38, up to three pieces of information may be required to uniquely identify the item corresponding to the leaf 67. Thus, a first piece of information will result either in the item corresponding to node 64 being found or in a decision requiring a second piece of information at node 62. The second piece of information will lead either to the item corresponding to node 65 being found or to a decision requiring a third piece of information at the node 63. The third piece of information will lead either to the item corresponding to the leaf 66 or to the item corresponding to the leaf 67. Thus, the tree 50 and the tree 60 can be used to resolve the collision at the entry 18 and the entry 38, respectively.
Thus, using the direction table 10 and the corresponding trees, such as the trees 50 or 60, items can be uniquely identified. A wider direct table contains more entries and, therefore, can be used to uniquely identify a larger number of items. In other words, for a given number of items, a wider direct table will result in fewer collisions. Where more than one item can correspond to the same entry in a direct table, however, trees are utilized to resolve collisions.
Direct tables and trees each have different costs associated with them. For example, a direct table may require a first amount of memory for an entry. A tree may require a different amount of memory for a node. Similarly, it may take a different amount of time to find an item stored in an entry of a direct table than to find an item corresponding to a particular leaf on a particular tree. Thus, different combinations of the direct table and trees may be used to meet system requirements for space and speed. A designer must thus be able to select the direct table and trees which will meet the desired requirements.
FIG. 2 depicts a conventional method 70 for selecting the size of the direct table and the trees. The width of the direct table is selected, via step 72. The corresponding trees are determined, via step 74. It is then determined whether the direct table and corresponding trees perform within the desired parameters. If so, then the designers work is complete. If not, then step 72 is again performed and the method 70 repeats.
Although the method 70 functions, one of ordinary skill in the art will readily realize that the method 70 essentially relies on trial and error. Consequently, the method 70 may take a relatively long time and may require a highly skilled designer. Furthermore, the method 70 may result in a combination of a direct table and trees which, although meeting system requirements, is far from optimal.
Accordingly, what is needed is a system and method for providing a combination of a direct table and a tree which is better able to optimize the desired properties of the direct table and tree. The present invention addresses such a need.
A method and system for selecting a direct table and a plurality of corresponding trees in a computer system is disclosed. The method and system comprise selecting a plurality of widths for the direct table and probabilistically determining at least one desired property of the direct table and the plurality of corresponding trees for the plurality of widths. The method and system also comprise selecting a width of the plurality of widths for the direct table and the plurality of corresponding trees based on the at least one desired property of the direct table and the plurality of corresponding trees.
According to the system and method disclosed herein, the present invention provides a relatively simple mechanism for selecting the optimal width of the direct table and the corresponding plurality of trees.