The present invention relates to computer database systems and more particularly to processing embedded queries containing a grouping operator.
Relational databases store information in collections of tables, in which each table is organized into rows and columns. FIG. 6(a) illustrates an exemplary database containing two tables, customers table 600 and orders table 620, that is useful in recording information about a sales operation. The columns of customers table 600 hold attributes for customers, for example, a customer identifier CID 602, an a zip code ZIP 604, and each row 610-618 is an entry for a respective customer. For example, row 610 is an entry for customer with a CID of 2 and a ZIP of 19555. The orders table 620 holds information in columns for the orders that the customers place. Such information may include, for example, an order identifier OID 622, the customer identifier CID 624 of the customer placing the order, and a dollar amount PRICE 626.
A user retrieves information from the tables of a relational database by entering input that is converted to queries by a database application. The database application submits the queries to a database server. In response to receiving a query, the database server accesses the tables specified in the query to determine which information within the tables satisfies the query. The information that satisfies the queries is then retrieved by the database server and transmitted to the database application and ultimately to the user. Database servers are also capable of combining or xe2x80x9caggregatingxe2x80x9d information contained in the tables in response to a query. For example, a common query in the exemplary database is to total all the order amounts to compute a sales figure for the customers in a given ZIP code.
For any given database application, the queries must conform to the rules of a particular query language. Most query languages provide users with a variety of ways to specify information to be retrieved. For example, in the Structured Query Language (SQL), the following query requests the retrieval of sales information for customers in zip code 19555 by adding up the order amounts:
select v.CID, v.SALES
from CUSTOMERS,
(select CID, sum(PRICE) as SALES
from ORDERS
group by CID) v
where CUSTOMERS.CID=V.CID and ZIP=19555;
This query actually contains two queries, one of which is embedded in the other. The inner query is in the form of a view definition, and the view thus defined is referenced by the name xe2x80x9cvxe2x80x9d. In this example, the view v requests the database server to group all the rows in orders table 620 by the customer identifier CID 622 and sum all the values of the PRICE column 626 as a SALES figure for each group of rows. Referring to FIG. 6(b), view v 640 therefore contains two columns, CID 642 for the customer identifier and SALES 644 for the aggregate sales figure as a sum of the PRICE column 626 values for each of the groups of rows.
The outer query requests the database server to select those sales figures 644 of view v 640 in which the customer identifier CID 602 matches the view CID 642 and the zip code ZIP 604 of the customer is 19555. In the example, the outer query selects a customer identifier CID of 2 and a sales figure of 45 from row 650 of view v 640, because row 610, with a CID 602 of 2 and a ZIP 604 of 19555, satisfies the conditions of the WHERE clause.
This kind of query, which embeds an inner query with a grouping function, is fairly common in database applications because it is straightforward for users to write and understand. Executing this query in a straightforward manner, however, by first evaluating the view defined by the inner query and then evaluating the outer query, may not be scalable for tables containing a very large number of rows. Evaluating the view first results in performing a fill table scan of the table referenced in the FROM list of the inner query (the xe2x80x9cinnerxe2x80x9d table). In some environments, there can be millions of rows in the inner table, even though the outer query only selects a few rows of the inner table. Consequently, in such conventional approaches, the database server performs much unnecessary work in aggregating columns for the rows of the inner table that will never be selected by the outer query. As the number of rows in the inner table greatly increase, the performance of the processing this kind of query is substantially degraded. Therefore, it is desirable to process this type of query more efficiently.
One approach for optimizing this kind of query involves the use of a xe2x80x9cmagic set transformation.xe2x80x9d In a magic set transformation, a new, innermost query block is introduced into the inner query block to first select an appropriate subset of the data. In the example, a magic set transformation results in the following query, with the magic set query block labeled xe2x80x9cmagicxe2x80x9d and a predicate referencing the magic set marked in bold:
select v.CID, V.SALES
from CUSTOMERS,
(select CID, sum(PRICE) as SALES
from ORDERS,
(select CID from CUSTOMERS
where ZIP=19555) magic
where ORDERS.CID=magic.CID
group by CID) v
where CUSTOMERS.CID=v.CID and ZIP=19555;
Accordingly, when this transformed query is processed, the magic set query is evaluated first, selecting only one CID from customers table 600. In turn, this single CID selects only two rows 630 and 636 from the order tables 620 for view v, whose sales column is a summation of the price column 626 for aggregated rows. Consequently, a magic set transformation, which introduces an artificial entity that duplicates the outer table customers 600, is capable of reducing the amount of processing involved in grouping over the orders table 620.
Although the magic set transformation may be acceptable when the queries are very simple, e.g., involving only two tables, a disadvantage with the magic set transformation is particularly apparent when there is a much larger number of tables referenced in the outer query block and the inner query block. It is known that the performance of a join operation of multiple tables is highly dependent on the particular order in which the tables are joined. As a result, techniques have been developed in choosing an efficient join order, and generally these techniques are most effective when there are more options to consider, i.e., more tables to choose from.
The magic set transformation, however, reduces the number of options that are available for such techniques for choosing an efficient join order. For example, the magic set transformation maintains the separation of the inner tables of the inner query block from the outer tables of the outer query block, even though there may be a more efficient join order involving a combination of the inner tables and the outer tables.
Therefore, there is a need for a method of transforming a query containing an embedded query with a grouping operator that reduces the amount of processing to execute the entire query. There is also a need for a method of improving the processing time of such embedded queries that does not foreclose possible join orderings from consideration and, desirably, increases the number of possible join orderings. These and other needs are addressed by a query transformation that merges the embedded inner query block into the outer query block. By merging the embedded query, no particular join sub-ordering is dictated and inner tables are capable of being combined with the outer tables to form a efficient join ordering.
One aspect of the invention is a computer-implemented method and a computer-readable medium bearing instructions for processing a query in a database system by receiving an original query that contains an inner query block with a grouping operator such as GROUP BY or DISTINCT in SQL. The inner query block is embedded in an outer query block, for example, as a reference to a view or as a subquery. The original query is transformed into a transformed query by removing the inner query block from the original query and merging portions of the inner query block into the outer query block. The transformed query is then processed. Preferably, the transformed query is semantically equivalent to the original query, but includes the tables referenced in a FROM clause of the inner query block in a FROM clause of the outer query block.
Still other objects and advantages of the present invention will become readily apparent from the following detailed description, simply by way of illustration of the best mode contemplated of carrying out the invention. As will be realized, the invention is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the invention. Accordingly, the drawing and description are to be regarded as illustrative in nature, and not as restrictive.