1. Field of the Invention
This invention relates generally to relational data base management systems with query optimizers and, more particularly, to identifying values that occur frequently within data sets for use in generating query plans.
2. Description of the Related Art
A computer system data base contains information represented as values of data that typically are grouped as attributes of tuples in relational tables and are stored on a storage media, such as magnetic disks. A data base management system provides an interface for a user to access the information. In particular, a data base management system accepts a user query and retrieves information from the data base according to the specifications of the query. For example, in a data base having employee data, a user might wish to retrieve the names and addresses of all employees whose job title is "programmer". It is possible to formulate relatively complicated queries that might require cross-referencing multiple tables or groupings of data in the data base. For example, a user might wish to retrieve the names of all employees whose job title is "programmer", whose work site is "New York", that received a pay bonus in the last year, and whose tenure with the company is greater than five years. To respond to such a query, it might be necessary to perform table join operations and access an employee location table, employee pay table, and employee history table.
Even in the case of relatively simple queries, such as finding all employees whose job title is "programmer", it is helpful for a data base manager to know something about the size of the answer set before the information is actually retrieved from data storage. For example, it would be helpful to know the approximate size of the answer set so the appropriate memory allocations can be made. In the case of relatively complicated queries, the order in which data tables are accessed can have a significant affect on the response time, depending on the distribution of values in the data set. For example, a query might necessitate searching each of two data fields for a particular desired instance of a value in each field. If the desired instance has a high occurrence in one field but not in the other, then it is more efficient to first search the field of lower occurrence. This provides a lower number of records that must subsequently be searched.
To ensure that an efficient plan for responding to a query is implemented, including such details as the proper memory allocation and the order of searching, a data base management system includes a query optimizer that generates a search plan according to a set of optimization rules that consider size of the answer set and the frequency of occurrences. To implement the optimization rules and determine the frequency of occurrences in a data set, it is possible to scan each instance of each value and to create a frequency distribution of all the data values.
Scanning the data set can require multiple passes through the data set, which is exceedingly time consuming and is itself inefficient. In addition, scanning consumes valuable resources such as memory space and processor time, adding to the inefficiency. In general, inefficiency in the scanning approach is minimized if the data set values are uniformly distributed. For example, the query optimizer can be somewhat efficient with the scanning approach if the query concerns the age of employees and if the ages are stored within the data set in an ordered sequence from youngest to oldest, if the ages are uniformly distributed, and if the total size of the data set is known. Similar advantages can accrue if the query concerns the names of employees and if the employees are stored in alphabetical order, the names are uniformly distributed across the alphabet, and the total size is known. Scanning is generally unacceptable because, as might be expected, most data sets do not include uniformly distributed data. Rather, most occurrences of data values are highly skewed and therefore scanning on one field provides little or no benefit to the query optimizer on the other fields.
The paper "Probabilistic Diagnosis of Hot Spots" by K. Salem, et al., in Proceedings of the 1992 Data Engineering Conference at pages 30-39, describes techniques that can be used to identify frequently occurring values in a data set. Although the paper is directed to the problem of identifying data items that are more frequently accessed than others for memory buffer management techniques, it can be applied to identifying frequently occurring values for query optimizers. That is, the number of times an item is accessed can be assumed relevant to the number of times an item occurs and therefore the number of times it will be accessed in response to a query.
More particularly, the Salem paper describes using multiple functions that randomly map data values to counters. An independent mapping function is used to map each value to a counter. For each mapping function, a "hot" counter will be a counter with a count that exceeds a threshold value. A value that is mapped to a hot counter for every mapping function is reported as a frequent value. The mapping functions are also known as hashing functions and the counters are also known as hashing table entries. As with the scanning approach described above, for this scheme it is necessary to accumulate frequency statistics according to every occurrence of data items of interest.
Rather than scanning all values of a data set and generating a frequency distribution, it is possible to sample values from the data set and extrapolate a frequency distribution. Such frequency estimators are not as slow and wasteful of memory space as scanning approaches. Sampling frequency estimators eliminate the need for scanning every field of every record and therefore are not as slow, wasteful of memory space, or distribution dependent as the scanning approaches. Typical estimation methods, however, can be sensitive to the distribution of data values and also to the arrival distribution of the data values.
From the discussion above, it should be apparent that there is a need for a system that identifies values in a data set that occur more frequently than other values and estimates the number of times the frequently occurring values appear in the data, which permits a query optimizer to generate a query plan of improved efficiency, with improved accuracy and efficiency. The present invention satisfies this need.