A complex language like SQL (Structured Query Language) is often riddled with redundancy. A query language is used to write a query which is then submitted to an information retrieval system (sometimes called a database management system) so that the system may then obtain a response to the query (such as obtaining data stored in a database). It will be appreciated that a query may be asked or formed (structured) in many different ways. Users, query generators or application developers may not always write queries that are ideal for performance if that query is executed in the way that query is written. The reasons may be that there may be redundancy that is not intentional or that the user may not be sufficiently knowledgeable to write the query in a more efficient way. Looking at this from a different angle, a common practice is to give access to the database to users through predefined views of a database. These views may be predefined to hide the complexity of queries or to limit the data that may be viewed by users. Even though the queries may look simple, once these views are expanded and merged into the query, a database query compiler may have to process a very complex query. These queries may also end up with redundancy or less efficient queries when executed in its raw form.
Database management systems (sometimes referred to a database engines or simply DBMS) often include a given query rewrite mechanism used for transforming queries so as to make the given query more efficient. Removing redundancy is one of the primary goals of the query rewrite mechanism (sometimes called a component). Query languages continue to evolve and DBMS vendors continue to enhance their products. New versions of query languages may support more powerful language constructs but existing queries are not always changed to reflect the evolution in query languages. In order to exploit these query language enhancements within the DBMS, automatic internal query rewrite technology should be able to take advantage of new features without forcing existing application programs (those programs that generate queries) to change.
In the context of subqueries, much of the literature is focused on the unnesting of subqueries where the execution time is improved by suitably converting the subqueries into joins and/or using common subexpressions. Particularly beneficial for a massively parallel system (shared-nothing) environment, methods to decorrelate this query have been proposed. As disclosed in W. Kim. “On Optimizing an SQL-Like Nested Query”, ACM Transactions on Database Systems, 7 Sep. 1982, certain fixed forms of complex queries were recognized and rewritten. The work of U. Dayal: “Of Nests and Trees: A Unified Approach to Processing Queries that Contain Nested Subqueries, Aggregates and Quantifiers”; Proceedings on the Eighteenth International Conference on Very Large Databases (VLDB) pp. 197-208, 1987 improved on the technique where the use of the outer join solved the wrong result issue when the result of the subquery was empty. As disclosed in R. Ganski and H. Wong “Optimization of Nested SQL Queries Revisited”, Proceedings of ACM SIGMOD, San Francisco, Calif., U.S.A., 1987 pp 22-33, correlation values are collected in a temporary table and a distinct collection is projected before joining to the subquery.
A technique is disclosed in the following three references:    (1) I. S. Mumick, H. Pirahesh, and R. Ramakrishnan, The Magic of Duplicates and Aggregates. In Proceedings, 16th International Conference on Very Large Data Bases, Brisbane, August 1990;    (2) C. Leung, H. Pirahesh, P. Seshadri and J. Hellerstein, Query Rewrite Optimization Rules in IBM DB2 Universal Database, in Readings in Database Systems, Third Edition, M. Stonebraker and J. Hellerstein (eds.), Morgan Kaufmann, pp. 153-168, 1998; and,    (3) P. Seshadri, H. Pirahsh and T. Y. C. Leung, Complex Query Decorrelation, Proceedings of the International Conference on Data Engineering (ICDE), Louisiana, USA, February 1996;
This technique is called magic decorrelation, and it was developed where the relevant distinct values of the outer references are extracted and, based on these values, all the possible results from the subquery are materialized. The materialized results are joined with the outer query block on the outer referenced values. Although the rewritten query introduces extra views, joins and duplicate removal, better performance is expected since the subquery is evaluated once with a consolidated temporary relation and avoids a tuple-at-a-time communication overhead.
Decorrelation is not always possible and in some cases, even if possible, may not always be efficient. Jun Rao and Kenneth A. Ross, A New Strategy for Correlated Queries, Proceedings of the ACM SIGMOD Conference, pages 37-48, ACM Press, New York, 1998, discloses another technique where a portion of the query that is invariant with respect to the changing outer values is cached. The cached result is reused in subsequent executions and combined with the new results in the changing portion of the subquery.
The recognition of redundancy and inefficiency when processing such queries in commercial databases is evident in the following disclosures:
(1) D. Chatziantoniou and K. A. Ross, Querying multiple features of groups in relational databases, in Proceedings of the 22nd International Conference on Very Large Databases, pages 295-306, 1996; and,
(2) D. Chatziantoniou and K. A. Ross, Groupwise processing of relational queries, in Proceedings of the 23rd International Conference on Very Large Databases, Athens, pp 476-485, 1997.
In these disclosures, an extension of the SQL syntax is proposed that allows more efficient processing to be done on a group-by-group basis. This makes the queries simpler and easier to handle in the optimizer. The SQL standard compliant window aggregate functions syntax already implemented in DB2 Universal Database is a more powerful syntax. It also provides a way of expressing the queries that allows a reduction of redundancy and inefficiency. The subject of our paper is to transform queries automatically to exploit this relatively new feature.
C. A. GalindoLegaria and M. Joshi, Orthogonal Optimization of Subqueries and Aggregation, in Proceedings of ACM SIGMOD, International Conference on Management of Data, Santa Barbara, Calif., U.S.A. 2001, discloses decorrelation techniques adopted in the Microsoft® SQL Server product are described. The concept that is most relevant is one called SegmentApply. Whenever a join connects two instances of an expression and one of the expressions has an extra aggregate and/or a filter, they try to generate a common sub-expression (CSE). The extra aggregation is done on one consumer of the CSE and is joined to all rows in that group from the other consumer of the CSE. This is done one group at a time. They also consider pushing appropriate joins through the CSE.
So far none of the automatic techniques involve outright automatic elimination of subqueries. They are either decorrelated using additional joins or use a common sub-expression or run time cache to share common processing. A more recent technique to handle nested subqueries that might typically be correlated although the technique can handle non correlated subqueries is described by C. Zuzarte, H. Pirahesh, W. Ma, Q. Cheng, L. Liu, K. Wong. “WinMagic: Subquery Elimination Using Window Aggregation” in Proceedings of ACM SIGMOD, International Conference on Management of Data, San Diego, Calif., U.S.A. 2003. WinMagic uses SQL Window Aggregation functions to replace regular aggregation function and merges the outer and inner subqueries effectively eliminating the nested subquery. Here too the focus is on nested subqueries.
Related to this paper is the work on redundant join elimination. Redundant join elimination is useful to improve the performance of a query. Semantic query optimization techniques have been recognized to provide significant performance benefits and redundant join elimination is one such technique that has been written about before. In particular, using Referential Integrity relationships, one can do redundant join elimination as disclosed in M. Zaharioudakis, R. Cochrane, G. Lapis, H. Pirahesh and M. Urata. Answering complex SQL queries using automatic summary tables, in SIGMOD 2000, pages 105-116. In this automatic rewrite technique, a query may contain a join for which the result could be determined without the need for the join. The fact that the two tables are related through a child-parent Referential Integrity constraint, implies that every qualifying row from the child table matches one and only one row in the parent. If the parent table columns are not required in the output of the query, the join is redundant and can be eliminated.
Another example is the elimination of redundant outer joins. Here the condition is less stringent. Given a view containing T1 LOJ T2, a unique index on T2 covering the join columns is sufficient to eliminate the join if the query on the view does not require data from T2.
Accordingly, a solution that addresses, at least in part, this and other shortcomings is desired.