The present invention relates generally to technology used to access and manipulate data in large data sets. More particularly, the present invention is directed to a system and method for automatic and optimal access of data using auxiliary data structures known as indexes. Such indexes permit data queries of large data sets to run faster and require less memory to process.
Indexes are widely used in database tables. Typically, an index is created for one or more columns in a database table. One type of index is the xe2x80x9ctarget index.xe2x80x9d An entry in a target index logically consists of a  less than key:value greater than  pair, where the xe2x80x9ckeyxe2x80x9d corresponds to a datum in the data set that is to be indexed and the xe2x80x9cvaluexe2x80x9d indicates which record(s) in the data set have this particular key. For example, an index built on an xe2x80x9cAgexe2x80x9d column might have an entry for the age value 35 in the form:  less than 35:1,4,7 greater than , meaning that records 1, 4, and 7 in the indexed data set have an age value 35.
The above example represents the value as a list of record numbers. However, multiple possible representations for the value list exist. One possible representation is a bit vector or xe2x80x9cbitmap,xe2x80x9d in which a bit exists for each row in the data set and the nth bit being set to one indicates that the nth record in the data set has this key value. Depending on the number of distinct keys and the number of records in the data set, the optimal representation may vary. For example, for a small data set with many distinct keys, the representation of a list of row pointers is quite efficient. However, in a large data set with few distinct keys, the bit vector representation is typically more efficient.
In general, the representation chosen for an index strongly depends upon the selectivity that a given key imparts to a query constraint. That is, it depends on the likelihood that use of the key in a query will very specifically select a given record. The term weakly selective describes a constraint that retrieves many records from a table. Weak selectivity typically occurs when a column in a very large table has a small domain (set of possible values). For example, the domain of a Gender column in an xe2x80x9cEmployeesxe2x80x9d table consists of only two possible values for every rowxe2x80x94Male or Female. Constraints on that column will be weakly selective; they will usually retrieve a very large list of rows.
Larger domains might also give rise to weak selectivity. For example, an Age column in the same table would have a much larger domain than a Gender column, but constraints on age might still be weakly selective, especially if the data is not uniformly spread across the domain or if the constraints specify values that dominate the domain.
A strongly selective constraint will retrieve only relatively few records from a large number of records. For example, a query constraint on a social security number column of the Employees table will generally be strongly selective.
The term xe2x80x9ccardinalityxe2x80x9d is sometimes used to refer to domain size. A column with high cardinality has many possible keys. For example, the cardinality of a social security number is about one billion. In contrast, the cardinality of gender is two (male and female). Generally, though not necessarily, constraints on columns having high cardinality are strongly selective, and constraints on columns having low cardinality are weakly selective.
Conventional B-tree indexes are well suited for columns of high cardinality, while target indexes are well suited for columns having low or intermediate cardinality. As noted, a target index consists of a  less than key:value greater than  pair.
Some database management systems allow flexibility in representing a target index; that is, various representations are available for the indexes that the database administrator creates. For example, an index on a xe2x80x9cGenderxe2x80x9d column of a data set may be represented as a bitmap, while the index on an xe2x80x9cAgexe2x80x9d column may be represented as a compressed list of record IDs. Still further, a column of xe2x80x9czip codexe2x80x9d might be stored as an uncompressed list of record IDs. The choice should be driven, at least in part, by the storage space required for the index. A small storage requirement requires less memory to process and generally allows queries run faster (although a compressed list must allow additional processing time for decompression and recompression).
In the Red Brick(trademark) Warehouse relational database management system available from Red Brick Systems, Inc. of Los Gatos, Calif., a user can choose the representation of a target index by specifying the relative domain size for the column being indexed. For example, if the user specifies that the domain is small (e.g., gender), the system will generate a target index in which the list of records is represented as a bitmap. Alternatively, if the user specifies that the domain size is medium (e.g., the attribute is states), the system will build an index in which the value is provided as a compressed row list. Still further, if the user specifies that the domain size is large (e.g., zip code), the system will build an index in which the value is represented as an uncompressed row list. For very large domains, such as social security number, the system should be instructed to build a B-tree index.
While this approach allows the system to intelligently build an index that will usually be optimal for a given attribute, in some cases xe2x80x9cdata skewxe2x80x9d within a given column can degrade the performance of the index. Data skew results when the number of records per key in the index varies widely. For example, in a table of customers, the number of records associated with the state of California can be expected to be much larger than the number of records associated with the state of Delaware (due to the population difference between these two states). Thus, a query constrained to California customers will return a much larger number of records than a query constrained to Delaware records.
The effect of data skew on optimal index representations is illustrated by the following example. Assume that a company has a customer list containing one million records and that the company does business in California and Delaware, among other states. Nine percent of the company""s customers are based in California and 0.1 percent are based in Delaware. Therefore, the customer table for this company will contain 90,000 rows in which the customer""s state is California and only 1,000 rows in which the customer""s state is Delaware. If the company decides to build an index on the state column in its customer table, it might choose either a bitmap representation or a list representation.
Any bitmap representation will include 1,000,000 bits per key (state). Thus, both the California key and the Delaware key will have an associated bitmap of 1,000,000 bits each. Each of those bits is associated with one of the rows in the customer table. A value of 1 at a particular position in the bitmap indicates that the corresponding row of the table contains a record for the key associated with the bitmap. In the case of California, 90,000 of the bits will be 1s, and in the case of Delaware, only 1,000 of the bits will be 1s.
Alternatively, the index on the state column could be represented as a list. Assuming that each row ID in a list representation requires a storage space of six bytes (48 bits), then 4.3 million bits are required to store a list of the 90,000 row-IDs for California customers. Clearly, it is more efficient to have the California key index entry represented as a bitmap.
For Delaware, only 48,000 bits (48 bits/customerxc3x971000 customers) are required to store its index entry in a list representation. As this is significantly less than the 1,000,000 bits required for the bitmap representation, it would be more efficient to represent the value portion of the Delaware customers index entry as a list.
In view of the problem associated with data skew, an improved representation for target indexes would be desirable.
The present invention addresses this need by providing hybrid target indexes for data set domains. Such hybrid indexes contain multiple representations within a given index. The representations can be optimized for particular keys in highly skewed data set domains. For example, a hybrid target index on customer states might employ a bitmap to represent the index value for a California key and also employ, in the same index, an uncompressed list of row-IDs to represent the index value for a Delaware key.
Generally, the invention may be characterized as pertaining to a hybrid index on a domain of a data set (e.g., a collection of records in a database such as a relational database table). The hybrid index includes various entries, each of which includes a record list value associated with a key, with the value portion of an entry specifying a record or records in a data set that has the key for the entry. The hybrid index uses two or more representations (e.g., a bitmap representation and uncompressed record list representation) for the value portions of the entries.
The representation of a given key may depend upon the selectivity of that key in the domain. For example, the domain may include a first key and a second key (among others possibly), with the first key being more selective than the second key. In this case, the first key might have a value represented as a record list and the second key might have a value represented as a bitmap. In one embodiment, values are represented in increasing order of key selectivity as follows: (a) bitmaps, (b) compressed record lists, and (c) uncompressed record lists.
The hybrid index may include a record location area to facilitate locating an index entry for a particular key. The record location area may, in turn, include multiple  less than key:index location greater than  entries, each specifying the location in the hybrid index of an entry for a given key. The entries of the record location area may also specify the representation type (bitmap representation versus compressed list representation, for example) for the values associated with the keys in the hybrid index.
Another aspect of the invention pertains to a method of modifying a hybrid index when a record in its data set is inserted, deleted, or updated. The method may be characterized as including the following sequence: (a) identifying the hybrid index key of the record that is being inserted, deleted, or updated; (b) determining how to represent the value of the hybrid index entry for the key of the record that is being inserted, deleted, or updated; and (c) representing, as determined in (b), the value of the hybrid index entry for the key of the record being inserted, deleted, or updated.
In determining how to represent the value of the hybrid index entry, the system may determine the selectivity of the key of the record being inserted, deleted, or updated. More specifically, this may involve determining whether the selectivity of the key is above or below a predefined threshold. If it is above the threshold, the value may be represented as a bitmap. If it is below the threshold, the value may be represented as a list of row-IDs.
In some cases, the insertion, deletion or update will require changing a previous representation of the value to a new representation. For example, a value that may have originally been represented as a bitmap will, upon an update replacing its key in a record, be more appropriately represented as a list of row-IDs. This can result because a selectivity threshold has been crossed. When this occurs, it may be desirable to note the new representation in a record locator area for the hybrid index.
To prevent the value representation for a given key from frequently oscillating, it may be desirable to provide two distinct thresholds for changing a first representation to second representation. The first threshold would be associated with increasing the selectivity of the key and the second threshold would be associated with decreasing the selectivity of the key. In this case, the first threshold is greater than the second threshold.
Another aspect of the invention involves a method of creating the hybrid index in a computer system including a data set. This method may be characterized by the following sequence: (a) identifying the key of one or more records in the data set; (b) determining how to represent the value of the hybrid index entry for the key of the record or records; and (c) representing, as determined in (b), the value of the hybrid index entry for the key of the record or records. In a specific embodiment, determining how to represent the value at (b) involves determining whether the data set has less than a predefined number of records. When it does, the computer system specifies the representation based upon the fact that the data set is small (as opposed to basing the decision on the selectivity of a key).
These and other features and advantages of the present invention will be further described in the following descriptions of the invention with reference to the associated figures.