1. Field of the Invention
The present invention relates generally to computer database systems, and more particularly to systems and methods for database query optimization.
2. Description of the Related Art
Computer-implemented database management systems have been provided for managing data storage and retrieval in response to user commands. Thus, a user may request that a computer return all stored records, referred to as tuples, that contain one or more fields, i.e., attributes, which satisfy one or more corresponding user-defined conditions, known as predicates, and the database management system must determine how best to deliver the requested records. As one example, the user may request from a personnel database a listing of all personnel records of people who are older than a certain age and who have an income above a certain amount. Such a request is referred to as a query, and it is the function of the database management system to respond to the query quickly and efficiently in returning the requested data.
In a database management system of the compiling kind, an application program containing database queries is processed for compilation prior to run time. During compilation, database queries are passed to the database management system for compilation by a system bind component. The bind component, essentially a database management system compiler, translates the queries contained in the application program into machine language.
Because most users do not input queries in formats that suggest the most efficient way for the database management system to address the query, a database compiler component referred to as a query optimizer typically transforms a query input by the user into an equivalent query that can be computed more efficiently. This is done at compile time, in advance of execution.
More particularly, to enable the computer to process a user's query, the query optimizer of the database management system first transforms the user's request into an equivalent machine-readable form that is commonly based on relational algebra.
Having determined the most efficient relational algebraic expression of the query, the query optimizer next determines a computationally efficient strategy for executing the expression to retrieve the requested data. The strategy is commonly called a "query plan". The query optimizer usually considers several query plans and evaluates the computational cost of each to determine which query plan is most efficient. The query optimizer then selects the query plan which entails the least computational cost to execute. Because the difference in computational time between an efficient query plan and an inefficient query plan can be great, it is worthwhile for the query optimizer to conduct the evaluations discussed above prior to accessing the storage media to respond to a user request.
To evaluate the likely cost of a particular query plan as measured in computational time, the query optimizer relies on statistics which relate to the number of tuples containing attributes that satisfy the user-defined predicate. An important statistic used by query optimizers is quantile information which relates to the number of tuples satisfying a predicate. Quantile information is also useful in database partitioning during parallel processing, and in database mining.
Relatedly, the p% quantile of a set of data values is defined as the value below which p percent of the data values lie. As an example, consider an employee database in which one of the fields is salary. Suppose that 90% of employees have a salary less than $85,000. Then $85,000 is the 90%-quantile for the salary values in the employee database.
Consider the user's query: "Find all employees whose salary is between $50,000 and $60,000". Let the 50%-quantile be $49,000 and the 60%-quantile be $62,000. If there were 100,000 employee records in the database then the query optimizer can be sure that the answer will have at most 10,000 records (10% of total employees). Manifestly, such information is important for generating good query plans.
When estimating a p%-quantile, either a single value is provided as the answer or, more often, two values are provided as the bounds within which the true p%-quantile is guaranteed to lie. In order to obtain these quantile values, the optimizer could read into memory all the information and manipulate the data in memory, and obtain the results. But this solution has the drawback of needing a memory at least the size of the database itself. This is impractical for very large databases containing millions of data records. The query optimizer must deal with the constraint of limited memory, i.e., a memory size which is substantially less than the size of the entire database such that the entire database cannot be read into memory all at once. Thus, it is important to determine quantile values with limited memory.
One method that can be used to obtain quantile values for tuples satisfying predicates is simply to sort the data in the database and then make a pass over the sorted data to find the desired quantiles. Unfortunately, such a sorting procedure must be performed on the data for each attribute for which quantile values are desired, making this method computationally expensive. Also, each sorting step requires multiple passes over the database, particularly when the database is very large compared to available memory.
As an alternative, random samples of the database can be obtained, sorted, and quantiles estimated, but such a method cannot deliver an absolute error bound on the estimated quantiles. In yet another alternative, several passes can be made over unsorted data to find quantiles, but this procedure is also computationally expensive.
Accordingly, it is an object of the present invention to provide an apparatus and method in a query optimizer for estimating the number of tuples in an unsorted database which satisfy a predicate using the quantile information obtained in only one pass over the data.
Another object of the present invention is to provide an apparatus and method for estimating the number of tuples in an unsorted database that satisfy a predicate which produces guaranteed and tight bounds for the number of estimated tuples.
Still another object of the present invention is to provide an apparatus and method for estimating the number of tuples in an unsorted database that satisfy a predicate which is easy to use and efficient in terms of the computational space required.