Databases are used to store information for an innumerable number of applications, including various commercial, industrial, technical, scientific and educational applications. As the reliance on information increases, both the volume of information stored in most databases, as well as the number of users wishing to access that information, likewise increases. Moreover, as the volume of information in a database, and the number of users wishing to access the database, increases, the amount of computing resources required to manage such a database increases as well.
Database management systems (DBMS's), which are the computer programs that are used to access the information stored in databases, therefore often require tremendous resources to handle the heavy workloads placed on such systems. As such, significant resources have been devoted to increasing the performance of database management systems with respect to processing searches, or queries, to databases.
Improvements to both computer hardware and software have improved the capacities of conventional database management systems. For example, in the hardware realm, increases in microprocessor performance, coupled with improved memory management systems, have improved the number of queries that a particular microprocessor can perform in a given unit of time. Furthermore, the use of multiple microprocessors and/or multiple networked computers has further increased the capacities of many database management systems.
From a software standpoint, the use of relational databases, which organize information into formally-defined tables consisting of rows and columns, and which are typically accessed using a standardized language such as Structured Query Language (SQL), has substantially improved processing efficiency, as well as substantially simplified the creation, organization, and extension of information within a database. Furthermore, significant development efforts have been directed toward query “optimization”, whereby the execution of particular searches, or queries, is optimized in an automated manner to minimize the amount of resources required to execute each query.
Through the incorporation of various hardware and software improvements, many high performance database management systems are able to handle hundreds or even thousands of queries each second, even on databases containing millions or billions of records. However, further increases in information volume and workload are inevitable, so continued advancements in database management systems are still required.
One area that has been a fertile area for academic and corporate research is that of improving the designs of the “query optimizers” utilized in many conventional database management systems. The primary task of a query optimizer is to choose the most efficient way to execute each database query, or request, passed to the database management system by a user. The output of an optimization process is typically referred to as an “execution plan,” “access plan,” or just “plan” and is frequently depicted as a tree graph. Such a plan typically incorporates (often in a proprietary form unique to each optimizer/DBMS) low-level information telling the database engine that ultimately handles a query precisely what steps to take (and in what order) to execute the query. Also typically associated with each generated plan is an optimizer's estimate of how long it will take to run the query using that plan.
An optimizer's job is often necessary and difficult because of the enormous number (i.e., “countably infinite” number) of possible query forms that can be generated in a database management system, e.g., due to factors such as the use of SQL queries with any number of relational tables made up of countless data columns of various types, the theoretically infinite number of methods of accessing the actual data records from each table referenced (e.g., using an index, a hash table, etc.), the possible combinations of those methods of access among all the tables referenced, etc. An optimizer is often permitted to rewrite a query (or portion of it) into any equivalent form, and since for any given query there are typically many equivalent forms, an optimizer has a countably infinite universe of extremely diverse possible solutions (plans) to consider. On the other hand, an optimizer is often required to use minimal system resources given the desirability for high throughput. As such, an optimizer often has only a limited amount of time to pare the search space of possible execution plans down to an optimal plan for a particular query.
There are a few SQL criteria, or clauses, that result in re-ordering a result set that is returned for a query. Examples of such clauses include the GROUP BY and ORDER BY clauses. A GROUP BY clause aggregates records in the result set that have a common value in a specified field or fields. An ORDER BY clause arranges the records of the result set in a specified order.
To illustrate the use of these clauses, consider the example of an Entertainment Agency database that includes a table for each entertainer and a table for each engagement. An example query that fetches the entertainer's name and contract price for each engagement might resemble:
SELECT Entertainers.StageName, Engagements.Month,
Engagements.ContractPrice
FROM Entertainers
INNERJOIN Engagements
ON Entertainers.ID=Engagements.ID
ORDER BY Entertainers.StageName, Engagement.Month
The result set returned would resemble:
StageNameMonthContractPriceAl BuckJanuary $200.00Al BuckJanuary $500.00Al BuckFebruary $185.00Al BuckMarch $200.00Al BuckMarch $110.00Carol TrioJanuary$1600.00Carol TrioFebruary $410.00Carol TrioFebruary $680.00Carol TrioFebruary $100.00In particular, the ORDER BY clause causes the result set to be returned in descending alphabetical order based on the entertainers stage name and then on the month (in calendar order). If the query is modified slightly to include a GROUP BY clause as below:
SELECT Entertainers.StageName, Engagements.Month,
SUM(Engagements.ContractPrice)
FROM Entertainers
INNERJOIN Engagements
ON Entertainers.ID=Engagements.ID
GROUP BY Entertainers.StageName, Engagements.Month
Then the result set is aggregated for each unique combination of fields within the GROUP BY clause. The result set would return:
StageNameMonthContractPriceAl BuckJanuary $700.00Al BuckFebruary $185.00Al BuckMarch $310.00Carol TrioJanuary$1600.00Carol TrioFebruary$1190.00
When the GROUP BY and ORDER BY clauses include references to more than one table as in the above example (i.e., Entertainers and Engagements), the optimizer must create a temporary file to hold the result set in order to perform the GROUP BY or ORDER BY operation. The creation of the temporary file, however, often slows the performance of the query.
Moreover, where queries are interactive, the need to create a temporary becomes even more problematic. Interactive queries often return initial records to a user while the query continues to concurrently execute to generate the entire result set. Thus, because queries like those above must finish running entirely and the temporary file created before the GROUP BY and ORDER BY operations can be performed, interactive queries often appear to be slow and unresponsive.
When a GROUP BY or ORDER BY clause references only one table, however, then creation of a temporary file is not needed in order to perform the re-ordering of the result set. Thus, the above performance penalties can be avoided. Accordingly, there is an unmet need in the prior art of optimizing SQL queries that reduces the number of different tables referenced in ORDER BY and GROUP BY clauses.
Queries which the optimizer can handle often include Join operations of various types. Join operations involve searching across two tables in various ways to identify records that match search conditions. One area that optimizers can particularly optimize a query plan having these join operations involves what is known as “join order”. A query plan can include a query that involves joining of three or more tables. Because a single join is limited to accessing two tables, such multi-table joins are performed in sequence according to a particular order. For example, a query that involves joining tables A, B and C can often be performed as a join of table A and B followed by a join of table A and C. Alternatively, in many instances, the same query can be performed as a join of Table A and C followed by the join of Table A and B.
When an ORDER BY or GROUP BY clause is present in a join operation, the optimizer is locked into a particular join order even if that order may not be optimal when performing the ordering or grouping according to an index. For example, in the above examples, the first field of the ORDER BY and GROUP BY clauses is Entertainers.StageName. Under these circumstances, the optimizer must lock Entertainers as the first table in the join order. This requirement prevents the optimizer from selecting a join order that may be more optimal. Accordingly, there is also an unmet need in the prior art of optimizing SQL by rewriting ORDER BY and GROUP BY clauses so as to expand the set of possible the join orders from which an optimizer can select.