This invention relates to the field of database query optimizers, and more particularly, to an improved database query optimizer that can select a query plan based upon different performance goals, such as returning the first or last row of a query in the minimum amount of time, or minimizing the use of computer resources such as computer memory.
Computers have the capability of storing vast amounts of data. For example, computers can store and retain data related to thousands of employees of large multi-national corporations, including the departments in which they work, their employee numbers, salaries, job descriptions, geographical locations, etc. In order to extract selected pieces of that data from such large computerized databases, users can present a query to the database system in the form of an SQL statement. For example, an SQL statement may ask the database system to list the names of all employees having employee numbers 1001 to 2000. A properly structured SQL statement will result in a list of records that satisfy the question or "query." In this example, the query would produce the names of 1000 employees, assuming that the employees had sequential employee numbers.
Once the user inputs a query into the computer, an SQL compiler operates on the query to develop an efficient way to extract the desired information from the database. Typically, the compiler generates a large number of different, but logically equivalent, plans for executing the same query. These "plans" are typically represented in computer memory as query trees, wherein each node of the tree includes a relational operator, such as a "sort" or "merge" operator. "Relational operators" are operators that receive one or more tables as input and produce a new table as an output. Join, Union and Union All are examples of operators that receive two tables as inputs. Group-by and Sort are examples of relational operators that receive only one table as input, such as a "sort" or "merge" operator. The optimizer program selects the query tree with the lowest estimated cost to respond to the query. In database parlance, "cost" is usually measured in terms of the amount of computer resources utilized by the computer in executing the SQL statement, for example, the number of I/O's or CPU instructions.
A major problem with existing optimizers is that, in many cases, they do not properly estimate the cost of carrying out the query. For example, known optimizers first estimate the number of CPU instructions, I/O operations and, in distributed systems, the number of messages that would be needed to carry out the SQL statement. See, e.g., P. G. Selinger, et al., "Access Path Selection in a Relational Database Management System," Proceedings of the ACM-SIGMOD International Conference on Management of Data, June 1979, and L. F. Mackert and G. M. Lohman, "R* Optimizer Validation and Performance Evaluation for Distributed Queries," Proceeding of the Twelfth International Conference on Very Large Data Bases, Kyoto, Japan, August, 1986. Once the number of such instructions, I/O operators and messages is predicted, such optimizers assign a cost to each operator, add up the cost associated with the execution of each operator and thus produce a total predicted cost for the particular plan. The cost of various plans are calculated and compared. Then, the operator can select the lowest cost plan for execution.
Unfortunately, because the computer can conduct some operations in parallel while others must be conducted serially, these optimizers cannot properly predict which plan will produce the first row or the last row in a minimum amount of time. Minimizing the amount of time needed to produce the first or last row of a query is often more important than the cost of the query in terms of the physical resources utilized in executing the query. Thus, there is a need for an improved database cost model that accurately accounts for speed in returning an answer to a query as well as the resources utilized. In short, time may be the most important "cost" associated with a query.
Known prior art optimizers also do not properly account for memory utilization. In situations where the memory of the computer is limited and/or where the algorithm utilizes large amounts of data, memory utilization is an important parameter to consider in determining the cost of the query. It will be recognized that the execution of an SQL statement will cause the computer to utilize certain resources such as, for example, CPU instructions and disk seeks. These types of resources clearly have some elapsed time associated with their execution. Therefore, one can predict that, all else being equal, an SQL statement that requires more instructions and more seeks than another statement will take longer to return a row than the other statement. However, the relationship between memory utilization and elapsed time is much more complicated and not at all intuitive. For example, a statement that uses more memory may actually execute faster than one which uses less memory. This complicated relationship between memory utilization and elapsed time may be one of the reasons that prior art optimizers have failed to account for memory utilization.
In view of all of the above, there is a need for an improved optimizer. In particular, there is a need for an optimizer that can consider and account for flexible performance goals, such as how quickly a plan returns a first row, or a last row, or selecting a plan that will minimize the usage of certain types of computer resources, and one that properly accounts for memory utilization.