One common form of information retrieval system for computerized record keeping is a database management system (such as DB2™ Universal Database Management System sold by IBM™). Between the actual database (i.e. the data as stored for use by a computer) and the users of the system is a software layer known as a relational database management system (“RDBMS” or “DBMS”). The DBMS is responsible for handling database transactions (such as database queries or simply ‘queries’, table updates, table indexes, table deletes, etc.) thus shielding users from the details of any specific computer hardware or database implementation. Using relational techniques, the DBMS stores, manipulates and retrieves data in the form of table-like relations typically defined by a set of columns or attributes of data types and a set of rows (i.e. records or tuples) of data. The columns may further comprise restrictions on their data content (i.e. valid domains) and may be designated as a primary key or unique identifier for the relation or a foreign key for one or more other relations.
The standard database query language for dealing with relational databases implemented by most commercial DBMSs is the Structured Query Language (“SQL”). SQL comprises both data definition operations and data manipulation operations. To maintain data independence, a query (i.e. a set of SQL commands) instructs the DBMS what to do but not how to do it. Thus, the DBMS comprises a query processor for generating various query plans of execution and choosing the least expensive plan with respect to execution costs. Due to the high-level nature of relational expressions and a variety of implementation techniques, automatic query optimization is possible and often necessary to ensure more efficient query processing.
In accordance with well-known query translation processes, an SQL query is processed in stages. For example, an initial stage may cast the source query into an internal form or model following the preliminary steps of lexing, parsing and semantic checking. One example of an internal form or model is query Graph Model (QGM). The goal of this model is to provide a more manageable representation of queries to reduce the complexity of query compilation and optimization. The internal model is a data structure for providing the semantic relationships of the query. These semantic relationships are used by the query translator and optimizer components to rewrite the query in a canonical form. In a next phase, a plan optimizer produces a query execution plan such as by generating alternate plans and choosing a best plan based on estimated execution costs. A plan refinement stage may then be employed to refine the optimum execution plan in accordance with run-time requirements.
Queries may involve “aggregation”. Aggregation occurs where the desired information is not an attribute of a single tuple, but is rather the result of applying a function over multiple tuples. Examples of aggregation functions comprise minimum (“MIN”), maximum (“MAX”), and average (“AVG”). In SQL, the functions MIN, MAX, and AVG are column functions. As such, they operate on a set of values in a column to derive a single result value. In particular, the function MIN returns the smallest value in a column or set of values. The function MAX returns the largest value in a column or set of values. The function AVG returns the sum of the values in a column divided by the number of values in that column. Predicates that involve an aggregated attribute are referenced as aggregation predicates. In general, a predicate specifies a condition that is true, false, or unknown with respect to a given row in a database table.
The solution to a query often involves the intermediate step of solving one or more sub-queries that comprise aggregation predicates. In advanced database applications such as decision-support systems, queries can become very complex due to their dependency on many sub-queries and views. Aggregation predicates are important in these large database applications. Complex decision-support queries may have to retrieve data by applying one or more different aggregation functions on a number of base relations and then applying many predicates on the aggregation views. While aggregation renders query optimization with known techniques difficult and in many cases impossible, aggregation is necessary and optimization of aggregation queries, which may entail manipulating aggregation predicates, is critical for large-scale systems.
Consider the following example of the use of a MAX function aggregation predicate in a simple query:
CREATE TABLE T1(C1 INT, C2 INT, C3 INT)SELECT C1, C2, C3 FROM T1  WHERE C1 = (SELECT MAX(C1) FROM T1)
In the above example, the SQL “CREATE TABLE” statement is used to create a base table T1 410 having three integer valued columns C1, C2, and C3. In general, a SELECT query in SQL may have a SELECT clause (or list), a FROM clause, and a SELECT clause. The SELECT clause indicates the columns from which the data responsive to the query may be found. The FROM clause indicates those tables from which the columns named in the SELECT clause may be found. The WHERE clause indicates the conditions to be applied to one or more of the columns named in the SELECT clause. In the above example, the SELECT clause returns a result table that comprises the rows of columns C1, C2, and C3 of table T1 where the value of column C1 is equal to the maximum value of C1 in table T1. The condition C1=(SELECT MAX(C1) FROM T1) is an aggregation predicate. In this example, the predicate comprises a sub-query, namely, SELECT MAX(C1) FROM T1.
The column function MAX is found in the WHERE clause as a sub-query as opposed to being directly implemented (e.g. WHERE C1=MAX(C1)). In SQL, column functions cannot be stated in a WHERE clause because of the order of operations. The WHERE clause may be thought of as being evaluated before the SELECT clause. Consequently, when the WHERE clause is being evaluated, the column function does not have access to the set of values. This set of values is selected at a later time by the SELECT clause. This restriction may, for example, limit the use of known query merging techniques for optimizing queries having aggregation predicates.
One disadvantage of present optimization methods is their inefficiency in optimizing queries having aggregation predicates. To process the SELECT query in the above example, a known DBMS may typically evaluate the aggregation sub-query prior to applying the result of this evaluation to the main query. To evaluate this query efficiently with current methods, an index on at least column C1 is required. The sub-query would generally be executed with an index scan or access and the main query would be executed subsequently. Thus, if an index on column C1 does not exist, the query may require two index or table scans to satisfy the query using present methods. This is often a computationally expensive operation. In general, when a table scan is performed all the rows in the subject table are examined, one by one, to find rows that satisfy the query. When an index scan is performed for a subject table having an index, index entries are examined to find row identifiers that are then used to retrieve rows that satisfy the query.
A need therefore exists for a system and associated method of optimizing SQL queries containing aggregation predicates such as maximum (MAX) and minimum (MIN) when an index does not exist on the subject column. Accordingly, a solution that addresses, at least in part, the above and other shortcomings is desired. The need for such system and method has heretofore remained unsatisfied.