Data warehouses allow an organization to gather, store and make use of operational data, that is, information collected during normal operations. For example, a large retail store chain may warehouse cash register transactions to identify trends, e.g., what products are popular in different regions of the country. A typical data warehouse will receive a feed of detail dam Because of the large volume of detail data in the warehouse, many data warehouse features involve computing aggregate statistics of the detail data, and mining the aggregate statistics.
In many cases, it is desirable to retrieve specific records from the stored detail data For example, a telecommunications provider must be able to provide records of phone calls originated by or received by individuals in response to law enforcement requests. Since records of phone calls involving a particular phone number are very sparse in the data set, indexing is necessary to retrieve these records efficiently.
The detail data is usually collected continuously in time. The large size of the detail data set requires the use of horizontal partitioning. One reason is the limit on the maximum file size in many operating systems (typically 2 Gbytes). Another reason to use partitioning is to simplify data management. Keeping a rolling 2-year window of data is made easier if each partition corresponds to one day's worth of data finally, partitioning can cluster data relevant to answering a query (i.e., sum sales by day). A very large data table will be composed of hundreds to thousands of data partitions. For example, a typical conventional database 100 is shown in FIG. 1. Database 100 includes a number of data partitions, such as data partitions 102A-C. Each data partition includes a number of rows of data, such as rows 104A and 104B. Each row includes a number of keys, such as keys 106A and 106B.
A typical database organization is to have separate indexes for each data partition. Such an organization simplifies data management (indexes are dropped with data), reduces index size (which can be a significant fraction of the data size), and increases index concurrency (which can be a bottleneck). For example, in FIG. 1, each data partition 102A-C of database 100 has its own indexes 108A-C. These prior art indexes are termed dense indexes, because each index refers to every record in its data partition.
Data warehouses allow users to make sense of large quantities of detail data, typically by extracting summaries small enough for convenient manipulation. While most queries can be answered through the summary data, some queries can only be answered by accessing the detail data. For example, after using the summaries to identify a set of "interesting" customers, it may be desired to extract all detail records that describe an interaction with those customers. If the number of customers in the database is very large, it is likely that most of the data partitions of the table do not contain records describing an interaction with a particular customer. In a conventional database architecture, searching for these records requires that every index be searched for the key value. The cost of opening the index files and searching them can be much larger than the cost of retrieving the records that match the key value. The problem is that many index searches return a "not found" answer.
A need arises for a technique by which a very large database can be searched more quickly and with lower cost than can be achieved using a conventional database access technique.