A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The present invention relates generally to information processing environments and, more particularly, to computer-implemented methodologies for query optimization in a data processing system, such as a Database Management System (DBMS).
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as xe2x80x9crecordsxe2x80x9d having xe2x80x9cfieldsxe2x80x9d of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of a database management system is known in the art. See e.g., Date, C., xe2x80x9cAn Introduction to Database Systems, Volume I and II,xe2x80x9d Addison Wesley, 1990; the disclosure of which is hereby incorporated by reference.
DBMS systems have long since moved from a centralized mainframe environment to a de-centralized or distributed environment. One or more PC xe2x80x9cclientxe2x80x9d systems, for instance, may be connected via a network to one or more server-based database systems (SQL database server). Commercial examples of these xe2x80x9cclient/serverxe2x80x9d systems include Powersoft(copyright) clients connected to one or more Sybase(copyright) SQL Anywhere(copyright) Studio (Adaptive Server(copyright) Anywhere) database servers. Both Powersoft(copyright) and Sybase(copyright) SQL Anywhere(copyright) Studio (Adaptive Server(copyright) Anywhere) are available from Sybase, Inc. of Emeryville, Calif.
In today""s computing environment, database technology can be found on virtually any device, from traditional mainframe computers to cellular phones. Sophisticated applications, whether human resources information systems or sales force automation systems, can xe2x80x9cpushxe2x80x9d much of their complexity into the database itself. Indeed, this represents one of the main benefits of database technology. The challenge, however, is to support these applications, and the complex queries they generate, on small computing devices. At the same time, users expect the productivity and reliability advantages of using a relational DBMS.
Consider, for instance, the execution of a request for information from a relational DBMS. In operation, this request is typically issued by a client system as one or more Structured Query Language or xe2x80x9cSQLxe2x80x9d queries for retrieving particular data (i.e. data records meeting the query condition) from database tables on a server. For example, a simple SQL SELECT statement.
results in a list of the names of those employees earning $10,000, where xe2x80x9cemployeesxe2x80x9d is a table defined to include information about employees of a particular organization. The syntax of SQL is well documented, see e.g., the abovementioned xe2x80x9cAn Introduction to Database Systems.xe2x80x9d SQL has become the standard for relational database access, see, e.g. Melton, J. (ed.), xe2x80x9cAmerican National Standard ANSI/ISO/IEC 9075-2: 1999, Information Systemsxe2x80x94Database Languagexe2x80x94SQL Part2: Foundation,xe2x80x9d the disclosure of which is hereby incorporated by reference.
SQL queries express what results are requested but do not state how the results should be obtained. In other words, the query itself does not tell how the query should be evaluated by the database management system. Rather, a component called the optimizer determines the xe2x80x9cplanxe2x80x9d or the best method of accessing the data to implement the SQL query. The query optimizer is responsible for transforming an SQL request into an access plan composed of specific implementations of the algebraic operator selection, projection, join, and so forth. Typically, this is done by generating many different join strategies, evaluating the cost of each, and selecting the access plan with the lowest overall costs, where xe2x80x9ccostxe2x80x9d is a metric that measures a combination of factors, including, but not limited to, the estimated amount of computational overhead, the number of physical Input/Output (xe2x80x9cI/Oxe2x80x9d) operations, and the response time. However, producing an optimal access plan for any given SQL query is a complex problem.
Query optimization in a relational DBMS is largely based on the existence of conjunctive conditions in each query""s Where clause. Conjunctive conditions are useful because they must each evaluate to true in order for the query""s Where clause to be satisfied. Hence, such conjunctive conditions (or xe2x80x9cprime implicatesxe2x80x9d) can be naturally exploited by the query optimizer. However, traditionally the discovery of such prime implicates requires the database engine to first transform or xe2x80x9cnormalizexe2x80x9d the original search condition into conjunctive normal form and remove redundant terms.
The problem of simplifying arbitrary Boolean expressions is a well-known problem in the history of computer science. Originally, the motivation for the study of this problem was its role in the minimization of Boolean functions and switching circuits (see, e.g., W. V. Quine, xe2x80x9cThe problem of simplifying truth functions,xe2x80x9d American Mathematics Monthly, 59:521-531, 1952; W. V. Quine, xe2x80x9cA way to simplify truth functions,xe2x80x9d American Mathematics Monthly, 62:627-631, 1955; and James R. Slagle, Chin-Liang Chang, and Richard C. T. Lee, xe2x80x9cA new algorithm for generating prime implicants,xe2x80x9d IEEE Transactions on Computers, 19(4):304-310, 1970). Boolean state reduction has directly influenced power and material requirements by eliminating redundant gates and easing three-level NAND implementation. See, e.g., Olivier Coudert, xe2x80x9cOn solving covering problems,xe2x80x9d Proceedings of the 33rd Annual Conference on Design Automation, pages 197-202, Las Vegas, Nevada, June 1996. However, the difficulties of normalization and redundant term elimination in SQL query optimization have been mostly overlooked.
The goal of query optimization in a relational DMBS is not simply reducing the number of conditions to be examined. Although minimization of the number of conditions reduces the number of operations required during query processing, what is equally important to a relational DBMS engine is the discovery of prime implicates that can be exploited to deduce a cheaper data access strategy. These prime implicates may be redundant or subsumed by the original expression and therefore may be ignored or even pruned by a logic minimizer.
Theoretically, the most efficient SQL statementxe2x80x94in terms of the time required to execute itxe2x80x94could be determined by generating and costing all possible semantically-equivalent forms. However, in a real-world setting a database engine must generate useful prime implicates along with other semantic transformations without spending more time than is required to execute the unoptimized query. See, e.g., Shashi Shekhar, Jaideep Srivastava, and Soumitra Dutta, xe2x80x9cA formal model of trade-off between optimization and execution costs in semantic query optimization,xe2x80x9d Proceedings of the 14th International Conference on Very Large Databases, pages 457-467, New York, N.Y., August 1988.
Existing simplification algorithms described in the literature (see e.g., Olivier Coudert and Jean Christophe Madre, xe2x80x9cImplicit and incremental computation of primes and essential primes of Boolean functions,xe2x80x9d Proceedings of the 29th Annual Conference on Design Automation, pages 36-39, Anaheim, Calif., June 1992; H. R. Hwa, xe2x80x9cA method for generating prime implicants of a Boolean expression,xe2x80x9d IEEE Transactions on Computers, 23:637-641, June 1974; Peter Jackson, xe2x80x9cComputing prime implicates incrementally,xe2x80x9d Proceedings of the 11th International Conference on Automated Deduction, pages 253-267, Sarasota Springs, N.Y., June 1992; Peter Jackson and John Pais, xe2x80x9cComputing prime implicants,xe2x80x9d Proceedings, 10th International Conference on Automated Deduction, pages 543-557, Kaiserlautern, Germany, July 1990; Luigi Palopoli, Fiora Pirri, and Clara Pizzuti, xe2x80x9cAlgorithms for selective enumeration of prime implicants.xe2x80x9d Artificial Intelligence, 111 (1-2): 41-72, July 1999; W. V. Quine, xe2x80x9cA way to simplify truth functions,xe2x80x9d American Mathematics Monthly, 62:627-631, 1955; James R. Slagle, Chin-Liang Chang, and Richard C. T. Lee, xe2x80x9cA new algorithm for generating prime implicants,xe2x80x9d above; Rolf Socher, xe2x80x9cOptimizing the clausal normal form transformation,xe2x80x9d Journal of Automated Reasoning, 7(3): 325-336, September 1991; and Yalin Wang, xe2x80x9cTransforming normalized Boolean expressions into minimal normal forms,xe2x80x9d Master""s thesis, Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 1992) are all predicated on normalized inputxe2x80x94that is the input expression is either in conjunctive normal form (xe2x80x9cCNFxe2x80x9d) or disjunctive normal form (xe2x80x9cDNFxe2x80x9d). Conjunctive normal form is a mathematical expression that describes the Boolean condition wherein each clause contains only ORs and these clauses are connected by ANDs. A query that is fully transformed into CNF has only xe2x80x9cANDsxe2x80x9d at the top row of the query tree and only xe2x80x9cORsxe2x80x9d beneath.
The common approach used by current normalization algorithms is to convert a CNF expression to DNF (or vice-versa) simplifying the output expression to minimize redundancy. The reverse transformation yields a simplified expression in the original form (CNF or DNF). Current optimization techniques are predicated on the fact that the search conditions have been normalized into conjunctive normal form.
In the literature relating to the simplification of Boolean expressions (See, e.g., Alex Kean and George K. Tsiknis, xe2x80x9cAn incremental method for generating prime implicants/implicates,xe2x80x9d Journal of Symbolic Computation, 9(2): 185-206, February 1990; W. V. Quine, xe2x80x9cThe problem of simplifying truth functions,xe2x80x9d above; W. V. Quine, xe2x80x9cA way to simplify truth functions,xe2x80x9d above; W. V. Quine, xe2x80x9cOn cores and prime implicants of truth functions,xe2x80x9d American Mathematics Monthly, 66:755-760, 1959; and James R. Slagle, Chin-Liang Chang, and Richard C. T. Lee, xe2x80x9cA new algorithm for generating prime implicants,xe2x80x9d above), the basic unit of any expression is termed a xe2x80x9cliteral.xe2x80x9d In this document a xe2x80x9cpredicatexe2x80x9d corresponds to a xe2x80x9cliteral.xe2x80x9d The usefulness of converting the search conditions to conjunctive normal form is that for a clause that consists of only a single predicate (i.e., not xe2x80x9cORed with anythingxe2x80x9d), for any row in the result of that query that predicate must be true. Thus, if there are two such conditions in a CNF expression, then for any row of the result, both conditions must be true. If those two predicates refer to either attributes in the database or constants, then mathematical rules can be used to show that other conditions also must be true for any row in the result given that the first conditions are both true. For example, if you have only two clauses in a CNF expression that consist of a single predicate such as X=Y and Y=Z, then you can also infer that X=Z. For a query optimizer, discovering such prime implicates enables the optimizer to discover different access strategies for the query that will result in improved performance compared to optimizing the query in a naive fashion. Such prime implicates may enable the optimizer to discover new indexes that it can use, may avoid the need to scan certain tables or indexes, or cause a change in the join order of a particular query, thereby improving performance.
However, discovery of such prime implicates presents particular challenges in a relational DMBS environment that are not addressed by existing methods. One difficulty is that any particular SQL query sent to a database server need not have a search condition that is already in CNF. Traditionally, the database engine is first required to convert (normalize) the original search condition into CNF prior to searching for prime implicates. Another difficulty is that ANSI SQL requires the support of three-valued logic meaning that several of the simplification techniques utilized by existing methods no longer hold true. The computational problem with these simplification techniques is that converting any particular SQL query to CNF and eliminating redundancy in the expression is an NP-hard problem (see, e.g., Luigi Palopoli, Fiora Pirri, and Clara Pizzuti, xe2x80x9cAlgorithms for selective enumeration of prime implicants,xe2x80x9d above). A problem is called NP (non-deterministic polynomial) if its solution (if one exists) can be guessed and verified in polynominal time; non-deterministic means that no particular rule is followed to make the guess. An NP-hard problem is any one of a class of computational problems for which no efficient solution algorithm has been found.
In particular, this process of normalization of a given SQL query can result in a combinatorial explosion in the number of terms of the expression. Consider, for example, TPC-D Query 19 (TPC-D Benchmark Version 2.0.0.12, Transaction Processing Performance Council) illustrated below where the search condition is presented generally in disjunctive normal form (i.e., top level of query tree is ORs and the second level is ANDs), which is exactly the opposite of what an optimizer would normally like to see.
Optimization of Query 19 in its original form will likely yield a very inefficient access plan since the query lacks a conjunctive join condition. However, a naive transformation of the original Where clause""s 24 predicates into CNF yields a Where clause now containing 378 predicates. The resulting computational and memory overhead of generating this many terms makes this transformation impracticable in many of the environments in which relational database management systems are used. In addition, the memory and execution costs for each SQL predicate must be considered. For instance, each predicate requires memory to store, evaluate, and determine its result for every comparison for every row. This results in a significant overhead on system performance. Moreover, once the query has been fully normalized, it is arguably impossible to simplify the query further, so the optimizer will be forced to evaluate each of those 378 predicates for every row. Thus, additional processing capacity is expended evaluating the exponential number of predicates derived from CNF normalization.
It is difficult to determine, when given any particular SQL query, how many predicates will result when the query is converted to CNF, because the size of the resulting CNF expression depends on how the predicates are connected through ANDs and ORs. Accordingly, it is difficult to apply heuristic rules to estimate the size of the normalized expression given an arbitrary SQL query. From a practical standpoint, the size of the normalized expression can only be determined during the process of converting it to CNF.
Because this process of fully normalizing an SQL query can be very expensive (and often detrimental to product performance), there is great need for an improved method for discovering useful conjunctive conditions from an arbitrary input expression without performing a full normalization of the query. Given the increasing amount of information stored in databases, there is great interest in an efficient and practical method for simplifying and optimizing the search conditions of complex queries, even in small-memory environments.
The following definitions are offered for purposes of illustration, not limitation, in order to assist with understanding the discussion that follows.
A or alpha: In this document the greek letter xe2x80x9calphaxe2x80x9d or xe2x80x9cAxe2x80x9d is used to denote a function whose input is a predicate and whose output is the name of the attributes involved in the predicate.
CNF: xe2x80x9cCNFxe2x80x9d stands for conjunctive normal form which is a logical formula consisting of a conjunction of disjunctions where no disjunction contains a conjunction. For example, the conjunctive normal form of (A and B) or C is (A or C) and (B or C). CNF is often termed product of sums form. A clause C is in partial CNF if it is a conjunction of clauses D1,D2, . . . Dn where at least one Di is not a disjunctive clause. Conjunctive normal form is a mathematical expression that describes the Boolean condition wherein each clause contains only ORs and these clauses are connected by ANDs. A query that is fully transformed into CNF has only xe2x80x9cANDsxe2x80x9d at the top row of the query tree and only xe2x80x9cORsxe2x80x9d beneath.
DNF: xe2x80x9cDNFxe2x80x9d stands for disjunctive normal form which is a logical formula consisting of a disjunction of conjunctions where no conjunction contains a disjunction. For example, the disjunctive normal form of (A or B) and C is (A and C) or (B and C). DNF is often termed sum of products form. In disjunctive normal form the top level of the query tree is ORs and the second level is ANDs.
Implicant: A conjunctive clause A is an xe2x80x9cimplicantxe2x80x9d of a search condition E if A does not contain redundant predicates and A implies E; hence all conjunctive clauses of a fully normalized condition E in DNF are its implicants. A is termed a xe2x80x9cprime implicantxe2x80x9d of E if A is an implicant of E, A is fundamental, and there does not exist another implicant Axe2x80x2 such that Axe2x80x2 implies A.
Implicate: A disjunctive clause A is an xe2x80x9cimplicatexe2x80x9d of a search condition E if A does not contain redundant predicates and E implies A; hence all disjunctive clauses of a fully normalized condition E in CNF are its implicates.
Literal: The basic unit of any expression is termed a xe2x80x9cliteral.xe2x80x9d In this document a xe2x80x9cpredicatexe2x80x9d corresponds to a xe2x80x9cliteral.xe2x80x9d
Normalization: In the context of database design, normalization is the process of organizing data to minimize redundancy. In this document, normalization refers to the process of converting a database query into a simpler form that can be more efficiently analyzed by the query optimizer to deduce better data access strategies.
Predicate: In the context of relational databases a xe2x80x9cpredicatexe2x80x9d is a truth-valued function, possibly negated, whose domain is a set of attributes and/or scalar values and whose range is {true, false, unknown}. In this document a xe2x80x9cpredicatexe2x80x9d corresponds to a xe2x80x9cliteral.xe2x80x9d
Prime Implicate: A clause A is termed a xe2x80x9cprime implicatexe2x80x9d of a search condition E if A is an implicate of E, A is fundamental and there does not exist another implicate Axe2x80x2 such that Axe2x80x2 implies A. In this document the term xe2x80x9cprime implicatexe2x80x9d shall refer generally to exploitable predicates which can be utilized during optimization and/or query execution to reduce the query""s overall execution cost.
sch: The term xe2x80x9cschxe2x80x9d is used in this document to denote a function whose input is a predicate and whose output is the correlation names (usually table names) from which the attributes referenced in the predicate are taken.
The improved normalization methodology of the present invention is based on Socher""s minimization methodology (see Socher, R., above). Socher""s minimization method uses a matrix as its basic data structure with the rows of the matrix representing unique literals and the columns representing the particular conjunct (or disjunct) in which the literals appear. Implicates and prime implicates are discovered by generating all paths through the matrix and discarding subsumed paths.
Existing query optimizers require fully normalized search conditions as input and assume the use of two-valued logic. The improved methodology of the present invention, on the other hand, features several improvements that are key to allowing it to operate with reduced memory requirements on non-normalized three-valued SQL search conditions.
The improved normalization methodology of the present invention comprises a preprocessing phase, in which expressions are simplified whenever possible, and a normalization phase, in which the simplified expression is analyzed and either fully converted to conjunctive normal form (xe2x80x9cCNFxe2x80x9d) or, alternatively, modified by appending prime implicates to the original expression as described below. In the preprocessing phase, tautological and contradictory conditions are removed from the input expression and other conditions are simplified when possible. This preprocessing phase includes several steps that are designed to simplify the original query expression, thereby simplifying the matrix processing occurring in the normalization phase. Pruning conditions from the input data structure reduces memory consumption by decreasing the size of the input to the normalization routines.
The normalization phase takes an unnormalized search condition as input and evaluates the eligibility of the expression or portions of the expression for normalization. Rather than fully normalizing the expression, which in many cases is inefficient and results in an explosion of terms that must be processed, the normalization phase of the present invention selectively examines each element of the SQL statement by carefully descending the logic tree of the query. The expression is examined in segments and for each segment a determination is made as to whether or not it is advantageous to convert that segment of the expression to CNF. Segments that are too xe2x80x9cexpensivexe2x80x9d to fully convert are not fully normalized to avoid a combinatorial explosion of terms and the resulting impact on system performance. Effectively, a limit is imposed on the depth to which sub-expressions are normalized to avoid an explosion in the number of terms.
If during this process of selective normalization of an expression, it is discovered that full conversion to CNF is too expensive, then the normalization and simplification process is altered. Instead of producing a fully normalized expression, the normalization method of the present invention saves only the useful prime implicates it can derive from the original input. These prime implicates are conjunctive conditions that can be exploited by the optimizer to deduce a more efficient data access strategy.
This process of selective normalization avoids the combinatorial explosion in terms that full normalization might entail. Moreover, it facilitates the discovery of exploitable prime implicates within these non-normalized expressions. The useful conjuncts (prime implicates) inferred from the normalization routines are unioned with the original statement. The result is a search condition that is semantically equivalent to the original expression, but that contains useful prime implicates that can be exploited by the optimizer. This enables better optimization of the original SQL query without the system overhead that would result from full normalization of the query.