Many investigators have studied semantic query optimization for relational systems (e.g., Cheng et al, In Proceedings of VLDB, pg 687-698 (September 1999); Grant et al., In Proceedings of ICDE (1997); Chakravarthy et al., AC Transactions on Database Systems, 15(2):162-207 (1990), and the references cited therein). The techniques most frequently used, include: index introduction, joint elimination, scan reduction, join introduction, predicate elimination and detection of empty answers (Grant et al., 1997).
Both Tsatlos et al. (In Proceedings of 20th VLDB Conf., Santiago, Chile (1994)) and Fegaras et al. (In Proceedings of the 5th Int""l Workshop on Database Programming Languages (DBPL195), Umbria, Italy (1995)) have developed special constructs and types for representing physical structures, but the operations on them that can be used in a query plan (e.g., joins or comprehensions) do not explicitly distinguish them from relations/complex values. Research efforts investigating physical data independence as the central issue or closely related problems, have all recognized physical data independence as an optimization problem, that is how does one rewrite a query Q(xcex9) written against a logical schema xcex9 into an equivalent query plan Qxe2x80x2("PHgr") written against a physical schema "PHgr", given a semantic relationship between xcex9 and "PHgr". See, Chaudhuri et al., In Proceedings of ICDE, Taipei, Taiwan (1995); Levy et al., In Proceedings of PODS (1995); Chen et al., In Proceedings of the Int""l Conf. on Extending Database Technology (1994); Yang et al., In Proceedings of the 13th Int""l VLDB Conf. pg 245-254 (1987)); Keller and Basu, In Proceedings of the Int""l Conference on Parallel and Distributed Information Systems, 1994; Levy et al., In Proceedings of PODS, 1995; Qian, In Proceeding ICDE, pg 48-55, 1996; Rajarman et al., In Proceedings 14th ACM Symposium In Principles of Database System, pg 105-112 (1995); Fegaras et al., In Proceedings of the 5th Int""l Workshop on Database Programming Languages (DBPL95), Umbria, Italy (1995); Tsatlos et al., (1994); Yang et al., In Proceedings of the 13th Int""l VLDB Conference, pg 245-254 (1987).
Conventional relational optimization methods (Selinger et al., In Proceedings of ACM SIGMOD Int""l Conference on Management of Data, pg 23-34, 1979, Reprinted in Reading in Database System, Kaufmann, 1988), such as selection pushing and join reordering, have long relied on ad-hoc heuristics for introducing indexes into a plan. Gmaps (Tsatlos et al., 1994) have been proposed as an alternative, as have studies into object-oriented data independence (Kemper et al., In Proceedings of ACM SIGMOD Internat""l Conf. on Management of Data, pg 264-374 (1990), and into distributed, mediator-based systems (Wiederhold, IEEE Computer, pg 38-49 (1992) for information integration. However, the previously reported techniques (Chaudhuri et al., 1995; Levy et al., 1995; Qian, 1996; Rajarman et al., 1995) are neither general enough, nor flexible enough, to be adapted to the current problems.
Chase transformation was originally defined for conjunctive (tableau) queries and embedded implicational dependencies. Popa and Tannen (In Proceedings of ICDT, Jerusalem, Israel, January 1999) generalized the classical relational tableau chase procedure (Beeri et al., Journal of the ACM 31(4):718-741 (1984)) to work for the object-oriented model and dictionaries, and for dependencies that capture a large class of semantic constraints including referential integrity constraints, inverse relationships, nested functional dependencies, etc. Moreover, they have shown that classical tableau minimization (Chandra and Merlin, In Proceedings of 9th ACM Symposium on Theory of Computing, pg 77-90, Boulder, Colo., (1977; Aho et al., ACM Transactions on Database Systems, 4(4):435-454 (1979) can be generalized correspondingly, as chasing with xe2x80x9ctrivialxe2x80x9d (always true) constraints.
Limited use of the chase to path-conjunctive queries and dependencies, presented by Popa and Tannen, 1999, permit the capture of object-oriented queries and queries against Web-like interfaces described by dictionary (finite function) operations. Dictionaries also describe many physical access structures, giving succinct declarative descriptions of query plans, in the same language as the queries. Subsequently, Deutsch et al, in VLDB (September 1999), showed that the elements of the implementation mapping (physical access structures, materialized views, etc.) are uniformly captured by the same kind of constraints and that can use the chase (forwards and backwards) to find the solutions of the equation mentioned above.
Although in an earlier manuscript, Jarke et al., In Proceedings of ACM-SIGMOD, pg 316-325 (1984), described chasing with functional dependencies, tableau minimization and join elimination with referential integrity constraints, surprisingly few experimental results are actually reported in the prior art. Grant et al., (1997) report on join elimination in star queries that are less complex than the present experiments with EC2. Examples of SQO for OO systems appear in Beeri et al., Theoretical Computer Science, 116(1):59-94 (1993); Cherniack. et al., In Proceedings of 24th VLDB Conf., pg 239-250 (1998); Fegaras et al., 1995); and Grant et al., 1997. A general framework for SQO using rewrite rules that are expressed using OQL appears in the work of Florescu, xe2x80x9cDesign and implementation of the Flora Object Oriented Query Optimizer,xe2x80x9d PhD thesis, Universite of Paris (1996), and Florescu et al., Int""l Journal of Cooperative Information Systems 5(4) (1996).
Techniques for using materialized views in query optimization are discussed in the works of Chaudhuri, et al., 1995), both Florescu manuscripts (1996); Tsatalos et al., VLDB Journal 5(2):101-118 (1996) and Bello et al., In Proceedings of 24th VLDB Conf, pg 659-664 (1998). Also relevant is the work on join indexes by Valduriez, ACM Trans. Database Systems, 12(2):218-452 (1987), and on precomputed access support relations by Kemper et al., In Proceedings of ACM-SIGMOD Int""l Conf on Management of Data, pg 364-374 (1990).
However, the problem remains of how to optimize queries aimed at disparate targets, particularly in complex situations. The general problem is forced by data independence, i.e., how to reformulate a query written against a xe2x80x9cuserxe2x80x9d-level schema into a plan that also/only uses physical access structures and materialized views efficiently. The gmap approach (Tsatalos et al., 1996) works with a special case of conjunctive queries (PSJ queries). The core algorithm is exponential, but the restriction to PSJ is used to provide polynomial algorithms for the steps of checking relevance of views and checking a restricted form of query equivalence. However, in light of current findings, there is no measurable practical benefit from all these restrictions.
Moreover, the schemas, views and queries of Chaudhuri et al., 1995; Tsatalos et al., 1996; and Yang et al., 1987, lack significant complexity. Their experiments show that using views is possible, and in the case of Tsatalos et al., 1996, that it can produce faster plans. However, Yang et al. measured only optimization time, and Tsatalos et al. did not separate the cost of the optimization itself. Consequently, they do not offer value that can be compared with time reduction. Although Chaudhuri et al. showed a very good behavior of the optimization time as a function of plans produced, the findings are ineffective because the use of bag semantics restricts variable mappings to isomorphisms, thus greatly reducing the search space.
Deutsch et al. 1999, recently demonstrated the promising potential of the chase and backchase technique (CandB), but raised the natural question: Is this technique practical? This raises two sets of issues that until the present invention remained unanswered in the art:
1. Are these feasible implementations of technique? In particular:
(a) Is the chase phase feasible, given that even determining if a constraint is applicable requires searching among exponentially many variable mappings?
(b) Is the backchase feasible, given that even if each chase or backchase step is feasible, the backchase phase may visit exponentially many subqueries?
2. Is the technique worthwhile? That is, considering the significant cost of CandB optimization, is the cost of an alternative plan based only on the CandB technique, still better than the cost of a plan without CandB?
The present invention, also as set forth by Deutsch et al, in VLDB (September 1999) and Popa et al., In Proceedings of the 2000 ACM SIGMOD Int""l Conf. on Management of Data (May 2000) (each of which are herein incorporated by reference), amply demonstrates the development and application of the chase/backchase technique to systematically optimize generating alternative query plans, aimed at multiple disparate targets. It further provides a first optimization prototype that uses path-conjunctive query graphs internally.
The optimization method of this invention is referred to as the xe2x80x9cCandB technique,xe2x80x9d referring to chase and backchase, the two principal phases of the optimization algorithm. The optimization is complete and specified by a set of constraints, including constraints that capture physical access structures and materialized views. In the first phase, the original query is chased using applicable constraints into a universal plan that gathers all of the pathways and structures that are relevant for the original query and the constraints used in the chase. In the second phase, navigation through these subqueries is done by chasing backwards, trying to eliminate joins and scans. The search space for optimal plans consists of subqueries of this universal plan, and the size of the search space is advantageously reduced by xe2x80x9cstratifyingxe2x80x9d the sets of constraints used in the backchase. Each backchase step needs a constraint to hold, and the algorithm checks to confirm that it follows from the existing ones. Thus, the entire C and B technique is captured by constraints, and by only two generic rules.
The technique is definitely feasible for practical schemas and queries. The present invention shows that by using congruence closure and a homomorphism pruning technique, the chase can be implemented effectively in practice. The method efficiently and effectively reduces the size of either the query or the constraint set to overcome the problem presented when the backchase quickly becomes impractical when both query complexity and the size of the constraint set are increased. The inventive strategies will not only work in common situations, but that they are also complete for the case of path-conjunctive materialized views.
The method and techniques of the present invention capture and extend many aspects of semantic optimizations, physical data independence (use of primary and secondary indexes, join indexes, access support relations and gmaps), use of materialized views and cached queries, as well as generalized tableau-like minimization. Moreover, and most importantly, using a uniform representation with constraints (dependencies), the techniques make these disparate optimization principles cooperate easily. This presents a new class of optimization opportunities, such as the non-trivial use of indexes and materialized views enabled only by the presence of certain integrity constraints. The technique is valuable when only the presence of semantic integrity constraints enables the use of physical access structures or materialized views.
The CandB technique covers index and join introduction and, in fact, extends them by trying to introduce any relevant physical access structure. The technique also covers join elimination (at the same time as tableau-like minimization) as part of subquery minimization during the backchase.
In a embodiment of the invention, there is provided a method of systematically optimizing the processing of a query in a database management system that operates in a computer network, wherein the method comprises a) receiving the executable data query; b) producing a query plan by applying to the query a relational tableau chase procedure comprising logical constraints which capture all relevant elements for implementation mapping of the query; c) rewriting the query against a logical schema into an equivalent universal query plan written against a physical schema, given a semantic relationship between the logical schema and the physical schema that explicitly uses all relevant physical structures in the implementation; d) applying to the universal plan a sequence of backchase steps, which systematically combine use of indexes, materialized views, semantic optimization and minimization, to remove redundancies, joins and scans; e) generating an alternative cost-based optimal query plan; and f) executing the optimal query plan.
In another embodiment of the method of the present invention, the chase step utilizes applicable constraints to gather all path-conjunctive queries and physical access structures relevant to the original query. In yet another embodiment of this method, backchase step comprises removing data from the logical schema that is not in the physical schema. In a further embodiment of the method, the size of the search space for the backchase step is minimized by reducing the size of the query, comprising fragmenting the query and stratifying the constraints by relevance to each fragment. The size of the search space for the backchase step is also minimized in an additional embodiment, by reducing the size of the constraint set, comprising splitting the constraints independent of the query.
In an alternative embodiment of the method of the present invention, the size of the search space for the backchase step is minimized by reducing the size of both the query and the constraint set by partitioning into independent subparts, comprising (i) fragmenting the query and stratifying the constraints by relevance to each fragment, and (ii) splitting the constraints independent of the query.
Also provided in an embodiment of the invention is a database management system, comprising a) a database having a plurality of records; b) means for accessing and rewriting a query into a universal plan in accordance with a tableau chase procedure based on declarative constraints; c) an optimizer for optimizing a query accessing said relational database, by rewriting with the tableau chase procedure the query into a universal plan, followed by a sequence of backchase procedure steps based in part on declarative constraints which eliminate redundancies in the universal plan to reduce cost; wherein the rewrite rules are expressed in a high-level declarative language, which is an extension of ODMG/ODL and ODMG/OQL, and the query against a logical schema is rewritten into an equivalent universal query plan written against a physical schema, given a semantic relationship between the logical schema and the physical schema; wherein said optimizer uses said rewrite rules to generate query plans that are minimal in the number of scans and joins it performs; d) means for applying to said minimal query plans any cost-based query optimization procedure to produce a globally optimal plan; and e) means for executing said optimal plan.
In another embodiment of the present invention, there is provided a query optimization system for systematically optimizing a query invoking database tables, wherein the system comprises a) means for providing a high-level declarative language, which is an extension of ODMG/ODL and ODMG/OQL, and declarative chase and backchase rules to rewrite a query written against a logical schema into an equivalent universal query plan written against a physical schema, given a semantic relationship between the logical schema and the physical schema; b) means for generating alternative queries from the query and rewrite rules; c) means for generating query plans that is minimal in the number of scans and joins it performs; d) means for applying to said minimal query plans any cost-based query optimization procedure to produce a globally optimal plan; and e) means for executing this optimal plan.
In another embodiment of the present invention, there is provided an apparatus for use in systematically optimizing a query in a database system, wherein the apparatus comprises a) memory for storing at least one data item referenced by the query; b) and a processor coupled to the memory and operative to (i) systematically apply to the query a tableau chase procedures and a sequence of backchase procedure steps based in part on declarative constraints along with one or more algebraic rewritings based on explicit substitutions in order to generate a rewritten universal query plan and rewritten query plans that are minimal in the number of scans and joins performed, which are equivalent to the query, wherein each of at least a subset of equivalent substitutions represents a hypothetical database change; (ii) applying to the minimal query plans any cost-based query optimization procedure to produce a globally optimal plan; and (iii) execute this optimal plan.
Additional objects, advantages and novel features of the invention will be set forth in part in the description, examples and figures which follow, and in part will become apparent to those skilled in the art on examination of the following, or may be learned by practice of the invention.