1. Technical Field
This invention generally relates to computer systems, and more specifically relates to apparatus and methods for optimizing a database query when data skew is present.
2. Background Art
Since the dawn of the computer age, computers have evolved and become more and more powerful. In our present day, computers have become indispensable in many fields of human endeavor including engineering design, machine and process control, information storage and retrieval, and office computing. One of the primary uses of computers is for information storage and retrieval.
Database systems have been developed that allow a computer to store a large amount of information in a way that allows a user to search for and retrieve specific information in the database. For example, an insurance company may have a database that includes all of its policy holders and their current account information, including payment history, premium amount, policy number, policy type, exclusions to coverage, etc. A database system allows the insurance company to retrieve the account information for a single policy holder among the thousands and perhaps millions of policy holders in its database.
Retrieval of information from a database is typically done using queries. A query usually specifies conditions that apply to one or more columns of the database, and may specify relatively complex logical operations on multiple columns. The database is searched for records that satisfy the query, and those records are returned as the query result.
Queries that are frequently run may be optimized to enhance system performance. In one example of query optimization, an intermediate dataset is constructed that may satisfy frequently-executed queries. In order to determine when and how to construct an intermediate dataset, the query optimizer must be able to estimate the number of distinct values, i.e., the cardinality, in a column. This estimate may then be used by the query optimizer to determine how to build the intermediate dataset.
Various methods have been developed to estimate the number of distinct values in a column. However, the prior art assumes an even distribution of data. Thus, if a column has 100,000 records, and there are 10,000 distinct values in those 100,000 records, the prior art assumes that each distinct value has 10 records. Note, however, that in some circumstances the data in a database table may be skewed. Data skew occurs when a small number of distinct values represent a relatively large number of records in the table. For example, a column of last names would have an abnormally high occurrence of common last names like “Smith”, “Jones” or “Anderson”, while having an abnormally low occurrence of less common last names like “Abdo” or “Faunce”. When data skew is present, the prior art estimate of distinct values is incorrect, and results in optimizations that are based on incorrect information. Without a way to estimate cardinality in a manner that accounts for data skew, the computer industry will continue to suffer from inaccurate estimates of cardinality due to data skew when performing query optimizations.