Data base systems are widely used to store and access large quantities of data. System users can access portions of the stored data through queries that take advantage of the manner in which the stored data is organized. Many modern data base systems are relational data base systems. In such systems, the data is organized into a set of relations, each consisting of a table whose columns, referred to as attributes, define classes of data and whose rows, referred to as tuples, contain related values of the data. For example, information on a company's employees could be stored in an employee relation with four attributes. One attribute would contain the name of each employee, a second, each employee's salary, a third, each employee's position, and the fourth, each employee's starting date of employment. Each tuple, or row, would contain all four classes of information for a single employee.
Two kinds of relations exist in a relational data base system: base relations in which the data values are actually stored in the data base system; and views, which are constructed from base relations as an intermediate step in solving a particular query. A query posed to a relational data base system, indicates those relations from which the information responsive to the query can be found. The query also provides predicates which delimitate the scope of relevant information. For instance, a query asking for the names of all employees earning more than $30,000 has the predicate salary is greater than $30,000. If the data processor would look to the above exemplary employee relation for the answer to this query, it would select the relevant information from those tuples alone, whose salary attribute has a value greater than 30,000. Thus, "salary is greater than $30,000" is a constraint on the solution to the query.
Some queries involve aggregation. Aggregation occurs where the desired information is not an attribute of a single tuple, but is rather the result of applying a function over multiple tuples. Examples of aggregation functions include MIN, MAX, SUM, AVG and COUNT. A more complete list of aggregation functions can be found in Jim Melton and Alan R. Simon, Understanding the New SQL: A Complete Guide (Morgan Kaufmann Publishers Inc. 1993), incorporated by reference as if fully set forth herein. Predicates which involve a limit on an aggregated attribute are known as aggregation predicates. One example of an aggregation predicate is found in a query to find the number of employees earning more than $30,000. The solution to this query involves applying the COUNT function on the employee name attribute.
Often, the solution to a query involves the intermediate step of solving one or more subqueries. For instance, a company with ten divisions may have a database with an employee relation for each division. Given the above query, find the employee with the highest salary from all employees in the company, the data processor system cannot search one relation alone for the solution. Instead, the data processor system can search for the highest paid employee in each relation and create a view relation consisting of the highest paid employee in each division. This view relation is the solution to the subquery: find the highest paid employee in each division. The original query: find the highest paid employee in the company, can now be solved from the solution to the subquery, by selecting the tuple with the greatest value stored in the salary attribute.
Queries for most relational database systems are written in a language called SQL. A query in SQL will have a SELECT clause indicating the attributes from which the data responsive to the query can be found; a FROM clause, indicating those relations from which the attributes named in the SELECT clause, can be found; and a WHERE clause indicating the predicates to be applied on one or more of the attributes named in the SELECT clause.
The following is an example of a query in SQL, found in FIG. 4, based on the two base relations of FIG. 1, calls (10) and customers (11) whose solution requires the intermediate step of solving the two views of FIGS. 2 and 3, potential customers ("ptCustomers") (20) and wellCaUed (30). The query is a marketing query to gather information for a targeted mailing to customers who have been making long calls into those area codes where the longest calls coming in from anybody are relatively short. The query restricts its search to those customers who have been calling into the particular area code for a long time. The solution to this query includes first defining two view relations from the two base relations.
The two relations of FIG. 1, contain information relating to all telephone calls made over a telephone network during the previous year. The relation calls (10), stores information relating to calls placed on the network, including the source area code and number ("FromAC" and "FromTel"), the destination area code and number ("ToAC" and "ToTel"), and the length, and date of the call. Each number, including area code, on the network is associated with a tuple, and six attributes indicate the six types of information stored for each number. The name of each customer, their area code ("AC"), telephone number ("Tel") and membership level in the network (regular, silver or gold) ("MemLevel") are stored in customers (11). Accordingly, this relation comprises a tuple for each customer each with four attributes, indicating the four types of stored information.
ptCustomers (20) considers only those customers whose membership level is `silver`, and computes the maximum length call placed by each of those customers to every area code and the earliest date on which that customer placed a call to that area code. wellCalled (30) computes the longest call placed to each area code among all the calls made over the network during the previous year. The query of FIG. 4 chooses the tuples from ptCustomers (20) which identify customers whose first call to a particular area code was prior to Apr. 1, 1994 and whose longest call to that area codes was more than ten minutes, for those area codes to which no one has made a call lasting more than one hundred minutes. The maximum length and earliest date are computed by applying Max and Min aggregation functions on the Date and Length attributes of calls (10), respectively.
Both the query and the view definitions use Select-From-Where clauses. The query states: SELECT the information stored in attributes, AC, Tel, MaxLen and MinDate, FROM wellCalled (30) and ptCustomers (20), only for those tuples WHERE the information stored in the attribute TOAC of wellCalled (30) is the same information as is stored in the attribute ToAC of ptCustomers (20), the value stored in attribute MaxLen of wellCalled (30) is less than one hundred, the value stored in attribute MaxLen of ptCustomers (20) is greater than ten, and the information stored in the attribute MinDate of ptCustomers (20) is earlier than Apr. 1, 1994.
Similarly, the Select-From-Where clauses of FIGS. 2 and 3 define the views ptCustomers (20) and wellCalled (30), respectively, based on base relations calls (10) and customers (11). In addition to the Select-From-Where clauses, the views are also defined by a Groupby clause. The Groupby clause helps to define the tuples of the view by combining all the tuples of the defining relations which have the same value for an identified attribute, into a single tuple in the view relation. For example, in FIG. 2, all the tuples in calls (10) with the same information stored in attribute ToAC, and all the tuples in base relation customers (11) having the same information stored in attributes AC and Tel, will be combined into a single tuple in ptCustomers (20).
Since many data bases are very large, an efficient method for responding to queries is essential. The phrase "query optimization" refers to the techniques used to organize the necessary computations, to reduce the time and memory resources required to respond to a query. The strategy of optimization is to reduce the number of predicates that must be computed in solving the query and to apply those predicates as early as possible. Thus the example query above, find the highest paid employee, can be optimized by applying the maximum salary predicate when constructing the view relation.
A number of techniques for applying predicates as early as possible are well known and are generally referred to as query rewrite algorithms. One class of these techniques are known as "predicate push-down techniques," which are described more fully in, Jeffrey D. Ullman, Principles of Database and Knowledge-Base Systems, (Computer Science Press 1989), incorporated by reference as if fully set forth herein.
In a related application, Query Optimization by Predicate Move-Around, U.S. Ser. No. 08/254,215, filed on Jun. 6, 1994, hereby incorporated by reference as if fully set forth herein, a general optimization technique is disclosed to move predicates around a query graph, to the particular node where their application is most efficient. A query graph is a pictorial representation of a query in which each of the query's subqueries and views form a block, or node. The query forms the root of the graph. Use of query graphs is well-known in the art. See, Inderpal S. Mumick, et al., Magic is Relevant, in Proceedings of ACM SIGMOD, at 247-58 (International Conference on Management of Data, May 23-25, 1990); Hamid Pirahesh, et al., Extensible/Rule Based Query Rewrite Optimization, in Starburst, in Proceedings of ACM SIGMOD at 39-48 (International Conference on Management of Data, Jun. 2-5, 1992); Inderpal S. Mumick and Hamid Pirahesh, Implementation of Magic in Starburst, in Proceedings of ACM SIGMOD (International Conference on Management of Data, May 25-27, 1994); and Alon Levy, et al., Query Optimization by Predicate Movearound, in Proceedings of the 20th International Conference on Very Large Databases at 96-107 (Jorge Bocca, et al., eds. Sep. 12-15, 1994), the disclosures of which are hereby incorporated by reference as if fully set forth herein.
In advanced database applications such as decision-support systems, the queries can become very complex due to their dependency on many subqueries and views. Optimizing these complex queries with prior art methods is difficult because the query blocks representing these queries cannot always be merged and traditional cost-based plan optimizers, which can only handle one query block at a time, cannot be applied. In particular, query optimizers are especially ineffective in dealing with queries involving aggregation.
Aggregation predicates are very important in large database applications, where complex decision-support queries must retrieve data by applying one or more different aggregation functions on a number of base relations, and then applying many predicates on the aggregation views. While aggregation renders query optimization with known techniques difficult and in many cases impossible, aggregation is necessary and optimization of aggregation queries, which entails manipulating aggregation predicates, is critical for large scale systems.
Manipulating aggregation predicates is important not only in query optimization, but in other applications as well, including, but not limited to, logic programming, constraint programming, constraint databases and global information systems. For example, in global information systems, techniques for pushing predicates down a query graph are used in order to determine which of the many available external databases is relevant to a given query. Where aggregation predicates are involved, a method for manipulating them is necessary.
The prior art methods for optimizing queries involving aggregation have limited optimization capability. For instance, in Kenneth Ross, et al., Foundations of Aggregation Constraints, in, Principles and Practice of Constraint Programming (Alan Borning, ed. 1994. LNCS 874), incorporated by reference as if fully set forth herein, the authors consider only two specific aggregation cases. The first case involves a single groupby operation in a view and they provide an intra-relation inference procedure to infer predicates containing arbitrary linear constraints. In the second case, the authors provide a view-to-relation inference procedure to deal with the case of a single aggregation function in a grouping operation. However, the constraint language and the reasoning framework provided do not allow the inference of the various types of predicates needed for query optimization.
There has also been work on optimizing queries with aggregation in correlated subqueries by way of decorrelation, converting the subqueries into views, with a magic-sets transformation to follow. See, Won Kim, On Optimizing An SQL-Like Nested Query, in, ACM Transactions on Database Systems, (September 1982); Richard A. Ganski and Harry K. T. Wong., Optimization of Nested SQL Queries Revisited, in, Proceedings of ACM SIGMOD at 23-33 (International Conference on Management of Data, May 1987); Umeshwar Dayal, Of Nests and Trees: A Unified Approach to Processing Queries that Contain Nested Subqueries, Aggregates, and Quantifiers, in, Proceedings of the Thirteenth International Conference on Very Large Databases VLDB at 197-208 (Sep. 1-4, 1987); M. Muralikrishna, Improved Unnesting Algorithms for Join Aggregate SQL Queries, in, Proceedings of the Eighteenth International Conference on Very Large Databases VLDB at 91-102 (Aug. 23-27, 1992); Mumick, Magic is Relevant, supra; and Mumick, Implementation of Magic in Starburst, supra, the disclosure of which are hereby incorporated by reference as if fully set forth herein.
Surajit Chaudhuri and Kyuseok Shim, Including Groupby in Query Optimization, in, Proceedings of the 20th International Conference on Very Large Databases at 354-66 (Bocca et al., eds., Sep. 12-15, 1994), incorporated by reference as if fully set forth herein, teaches that it is often possible to perform a grouping operation before a join or selection operation in the same query block. Although decorrelation and commuting groupings with joins, changes the structure of the query graph given a set of predicates, it does not provide the most optimal predicates for solving a given query.