This invention generally relates to a database management system performed by computers, and more specifically relates to the storage format of an encoded vector index (EVI).
An index in a book facilitates locating information on a specific topic quickly and without blindly paging through the book. Database indexes provide similar benefits by providing a method to quickly locate data of interest. Without an index, a database performs a full table scan, blindly searching through every row in a database table until the target data is located. Thus, depending upon where data resides in a database table, such a scan can be a lengthy and inefficient process.
Indexed scans of database tables are more efficient than full table scans since the length of database index entries are in most cases shorter than the database table entries. Shorter entries mean that more index entries can be stored in a single computer page. Indexed scans can therefore result in a considerable reduction in the total number of computer pages that must be processed in order to locate the requested data.
While indexed scans of database tables can improve performance, the complexity of the data being scanned and of the nature of the database query still determine how effectively a query can be implemented. Different queries place differing levels of processing demands on the database in unique ways. As a result, different index types are needed to cope with a users"" ever-changing workloads. One type of index is the encoded vector index (EVI), disclosed U.S. Pat. No. 5,706,495, Chadha et al., Jan. 6, 1998, Encoded-Vector Indices For Decision Support and Warehousing (hereinafter xe2x80x9cChadhaxe2x80x9d), which is incorporated by reference.
An encoded vector index (EVI) is a variation of the bitmap index concept. A bitmap index indicates whether a specific value exists for each row in a particular column. One bit represents each row. Thus, in the bitmap index for the value xe2x80x9cMNxe2x80x9d in the column xe2x80x9cLOCATION,xe2x80x9d the nth bit equals 1 if the nth row of the data table contains xe2x80x9cLOCATIONxe2x80x9d=xe2x80x9cMN,xe2x80x9d or 0 if that row holds a value other than xe2x80x9cMN.xe2x80x9d An EVI serves a similar purpose, but only one index is necessary to account for all the values occurring in the column (whether they be xe2x80x9cNY,xe2x80x9d xe2x80x9cMN,xe2x80x9d or any other). So in an EVI on the xe2x80x9cLOCATIONxe2x80x9d column, the nth position of the EVI contains a bit code that identifies the value of xe2x80x9cLOCATIONxe2x80x9d in the nth row of the table. Thus, whereas a separate bitmap index is required to map each particular key value in a database field, only one EVI is required to represent the same information. Thus, an EVI saves computer memory by including all possible key values for a given field in one database index. Chadha discloses a method to efficiently scan relational database information by performing bit-vector operations on EVI""s, instead of performing analogous operations on the relational database table itself.
Referring now to FIG. 1, a diagram explaining the components of a typical encoded vector index is illustrated. In this example, the EVI indicates which key value exists in the encoded database field for each relative database record number in a database table 40. Database table 40 is an exemplary database identifying locations and departments, for example, of a corporate organization. The EVI is formed over the location field 42 of the database 40, which may include a large number of other fields (not shown).
The EVI 45 is made up of two tables: EVI symbol table 50 and EVI vector 60. EVI symbol table 50 has an entry for each particular key value that can be found in the database field (in this case, the LOCATION field) of the particular database or database subset 40 for which the EVI is an index. FIG. 1 illustrates an EVI for a subset of the database 40, namely, the LOCATION field. Since only three different values appear in the LOCATION field of the database, EVI symbol table 50 contains three rows, one for each particular key value: xe2x80x9cMN,xe2x80x9d xe2x80x9cND,xe2x80x9d and xe2x80x9cWY.xe2x80x9d Notably, the key values in the EVI symbol table 50 are stored in a sorted order, so that a given key value can be found in the table using an alphabetic binary search through the table. The EVI symbol table 50 provides a code for each of these key values, and further provides a count for each of these key values, indicating how many records in the database table contain the key value. The codes in the EVI symbol table 50 are used to decode EVI vector 60, as described below. Notably, the codes are assigned to keys so that the codes are also in sorted order. The purpose of this will be explained below.
EVI vector 60 contains a row for every record in the database 40 for which the EVI is an index. Each vector row corresponds to a database record, and contains a code for the key value contained in the EVI indexed field. EVI vector 60 contains 20 rows, because there are 20 records in the database for which EVI is an index. Each code stored in EVI vector 60 corresponds to the value that exists in the EVI field in the corresponding database record.
The translation of the code is made possible by EVI symbol table 50. For example, for the first record in the database 40, the LOCATION field 42 in database 40 has a value of xe2x80x9cMN,xe2x80x9d which corresponds to a code of 1 in EVI symbol table 50. The first relative record in EVI vector 60. By looking at EVI symbol table 50, it can be seen that code 0 equates to a key value in the EVI field of xe2x80x9cMN.xe2x80x9d
Use of the EVI proceeds as outlined above. To search for a particular key value, that key value is converted to a code, and then the EVI vector is scanned, identifying each row in which there is a match to the desired code. Often the results are represented as a bit vector index, which can then be combined with other bit vector indexes to identify the results of a complex query. To search for records using a key range, the EVI symbol table is used to convert keys at the beginning and end of the range into codes. Then, the EVI vector is scanned, identifying each row having a code that falls between the codes for the range endpoints. As noted above, the codes are assigned to the keys in a sorted order, that is, so that the codes sort into the same order as the keys; thus, codes may be numerically compared to the range endpoint codes as the EVI vector is scanned to identify rows that meet a key range.
As noted above, EVI""s are built to reflect the counts and key values in one or more particular database fields, as those values exist in a database at the time that the EVI is built. Unfortunately, databases are frequently updated. In order for an EVI to stay current and accurately reflect a database, the EVI must he updated whenever the value of the one or more field(s) over which the EVI is built changes. This also applies when new records are added to the database and when new records are deleted.
Changes to the relational database table can affect an EVI symbol table in two ways. First, a change to the database may only require a change in key count for one or more key values. An example of this first type of change is deleting a database record, adding a record having a key value that already exists in the EVI symbol table 50, or changing a database field from one key value to a second key value that also exists in the EVI symbol table 50. In this first type of change, the EVI symbol table 50 is updated by updating the key count(s) to reflect the changes made to the database fields itself, without requiring a new EVI symbol table entry. At the same time, the EVI vector 60 is updated by deleting, adding or modifying the record corresponding to the changed record.
The second type of change to a database requires a new EVI symbol table entry. An example of this type of change is when a new key value is added to a database field, that is, a record that is modified or created is given a key value that does not exist in that particular database field in any other record in the database. A new key value might replace an old key value, as part of a change to an existing database record. Alternatively, a new key value might be assigned to database field when adding a new database record. Either way, the new key value will not exist in an existing EVI symbol table.
As noted above, the EVI symbol table 50 is stored in a sorted order by key value. Now, in the event of a new key value in the EVI field, a new EVI symbol table entry must be created. Typically, the new entry is added to the end of the existing EVI symbol table. The EVI symbol table thus includes a sorted area, followed by an unsorted area at the end of the sorted area. When there is an unsorted area, searches of the EVI symbol table require both an alphabetical search of the sorted area and a linear scan of the unsorted area. Furthermore, when responding to a key range query, the scan of the EVI vector must include both evaluating codes against the codes of the range endpoints, and comparison to codes from the unsorted area of the EVI symbol table for keys in the specified range. In both cases, the efficiency of processing the EVI is reduced.
To mitigate these inefficiencies, when the unsorted area of the EVI symbol table becomes excessively large, the EVI symbol table is rebuilt in order for all entries to be properly sorted. Unfortunately, rebuilding an EVI can itself be costly and create delays in processing. Furthermore, in highly active databases, even with frequent rebuilding, the unsorted areas of the symbol table can become unacceptably large due to frequent additions of key values to the table, most particularly due to additions made while the symbol table is being rebuilt.
Accordingly, new ways to store and manage indexes, such as EVI""s, are needed in order to continue to provide significant improvements in query performance; otherwise, database users will be hampered in their ability to maximize intelligent information retrieval.
In accordance with principles of the present invention, these needs are met through the use of a new storage format for the symbol table of an encoded vector index. Specifically, the symbol table comprises a hash table, entries of the hash table storing associated key values and codes for the encoded vector index. A hash table entry is assigned to a key value by performing a hash function upon a binary representation of the key value, to produce an index into the hash table. This storage format has the advantage that the hash table is not ordered and so need not be reorganized upon insertion or deletion of key values.
In a related aspect, a binary radix tree is used to locate data structures (such as entries in the hash table) that store information for key values in an encoded vector index. The binary radix tree comprises a plurality of nodes, nodes of the tree corresponding to binary digits of a binary representation of a key value, and branches in the tree corresponding to values of binary digits of said binary representation. A given binary representation for a key value thus corresponds to a path through the tree, ending in a terminal node. The terminal node contains a pointer to a data structure storing information for the key value. The binary radix tree thus permits the binary representation of a key value to be rapidly converted to a pointer to a corresponding data structure, so that the data structures need not be stored in an ordered manner, as is the case with entries in a hash table.
To further reduce the need to rebuild an index due to updating, in an additional aspect, the invention features methods for assigning codes to key values used in a encoded vector index. As noted above, codes are assigned to key values for an encoded vector index such that when the are sorted according to a code ordering, the key values corresponding to the codes are also sorted according to a key value ordering. However, in contrast to known methods for assigning codes for use in encoded vector indexes, in accordance with principles of the present invention, the codes are distributed such that at least one pair of adjacent codes in the code ordering have non-sequential values. By distributing the codes in this manner, upon insertion of a new key value into the index, there is more likely to be an available code value, between existing code values in the code ordering, that can be assigned to the new key value, alleviating the need to reorganize the code values upon such an insertion.
This aspect of the invention further features novel methods for selecting a code value corresponding to an inserted key value, by comparing the inserted key value to predecessor and successor key values in the key value ordering to determine the relative position of the inserted key value between the predecessor and successor key values. (This may be based, e.g., on the number of common bits in binary representations of the key values.) Then a code is selected for the inserted key value that is analogously positioned within the range of codes between the codes of the successor and predecessor key values.
A further aspect of the invention relates to embellishment of the information in a symbol table for an encoded vector index, to facilitate the generation of key range count. Respective storage locations of the symbol table store, in addition to key values and code values assigned to those key values an accumulated count of (a.) occurrences of the key value in the storage location, and (b.) occurrences of key values that precede, in a key value ordering, the key value in the storage location. This information may be used when responding to a request for a key range count by retrieving the accumulated count for the beginning and ending key values in the key range, and then subtracting the accumulated count for the beginning key value from the accumulated count for the ending key value. The accumulated key values are updated when key values are inserted or deleted so that they always accurately reflect the desired accumulated count.
The above and other objects and advantages of the present invention shall be made apparent from the accompanying drawings and the description thereof.