Modern relational database systems continue to process increasingly complex queries. Several reasons contribute to this trend. Queries submitted to relational database systems are increasingly being generated by applications. These queries tend to be more redundant and complex than human-typed queries. Additionally, a recent focus on extending relational database management systems (DBMS) with advanced functionality, like data warehousing and data mining, places a demand for fast execution of queries with complicated predicates.
A straightforward way to evaluate a complex expression is to use a sequential scan on the table and evaluate the condition as a filter. When the selectivity of some of the predicates is small, a cheaper alternative is to use one or more indexes and combine their record identifiers (RIDs) using index-intersection and union (IIU) operators before fetching the data pages from disk. Thus, for example a condition of the form X AND Y (written as XY) can be evaluated by generating record identifier lists for each of X and Y and then intersecting the two lists. Similarly, a condition X OR Y (written as X+Y) can be evaluated by using the union of the two record identifier lists for X and Y. In general, any multi-predicate selection condition can be evaluated by a series of index intersections and unions (IIU), which eventually leave a list of record identifiers to be fetched from the data table.
There are many possible intersection and union (IIU) plans for a given selection query. For example, a condition C1 may have the form (A+B)(A+C). The evaluation of C1 requires an index access for each of the literals, two index-union operations (for generating record identifiers for A+B and A+C), followed by one intersection operation for merging the record identifiers corresponding to A+B and A+C. An equivalent condition C2 has the form A+BC. The corresponding index-based plan for condition C2 requires one less index access (only access A once), and one less index-union operation.
The preceding paragraph is a straightforward example of how a query plan can be improved if it is slightly restructured. However, the general problem of identifying better plans for an arbitrarily complex boolean expression is not easy. The following example shows that disjunctive normal form (DNF) does not guarantee the best plan. Consider another condition C3: AB+AC in disjunctive normal form. An equivalent conjunctive normal form (CNF) is C4: A(B+C). It is not immediately obvious as to which of C3 and C4 would be better. C4 may be better in several cases since it saves on an index access for A and performs one less index intersection. However, C3 could be better than C4 if the intermediate record identification (RID) size of B+C is considerably larger than that of A. This example establishes that neither of the standard normal forms, CNF and DNF, provide an optimal solution for all queries. The relative merits of different index intersection and union (IIU) plans depend on the selectivities of the different predicates, and the costs of accessing different indexes.
Existing query optimizers approach this problem in three ways. The first approach is to use a sequential scan followed by a filtered evaluation of the condition, particularly when the selection condition is very large and contains several disjuncts and conjuncts. The second approach is to generate an index intersection and union plan directly from the form in which the condition is represented in the query without searching the space of index intersection and union plans for the best plan. The third approach is to rewrite the query in conjunctive normal form or in the disjunct form, neither of which is optimal in all cases.
Minimizing boolean expressions through factorization is a problem in several areas of computer science. In compiler optimization it is useful for generating optimal programs for evaluating boolean expressions, in object-oriented databases for ad hoc evaluation of queries with boolean predicates, in relational database systems for factoring join relations on disjuncts and in VLSI circuit design for reducing floor area of circuits. Of these, the boolean minimization problem has been studied most extensively in the VLSI literature.
Factoring boolean formiae to minimize the total number of literals is an important problem in VLSI design because the area taken up by a circuit for a boolean formula is roughly proportional to the number of literals in the formula. This problem is NP-hard and computing the optimal solution is computationally infeasible even for relatively small functions. However, the practical relevance of the logic minimization problem in VLSI has led to the design of several algorithms with various levels of complexity. These approaches can be grouped into three categories—algebraic, boolean and graph-theoretic. None of these approaches offer any guarantees about the quality of the factorization. Of the three classes of factoring methods, algebraic factoring is the most popular since it provides very good results while being extremely fast. Query optimization is different from the VLSI logic minimization, because every literal a boolean formula that represents a query is associated with a different fixed cost that depends on the literal's selectivity and index. Also, unlike the problem of VLSI logic minimization, the size of intermediate results is also important for query optimization since it affects index-intersection and index-union cost.
Early relational database management systems rewrote query expressions as a conjunctive normal form expressions and exploited only one index per expression. Others rewrote the expressions as a disjunctive normal form expression and union-ed each disjunct evaluated independently. These simple approaches were augmented in later systems to exploit multiple indexes by evaluating them as arbitrary index intersection and union plans. These approaches attempted to choose the best subset of eligible indexes and sequence the record identifier mergings for best performance. However, these approaches operate on the condition as directly expressed in the query and do not explore the space of alternative rewrites.
Techniques for optimization of user-defined predicates with varying costs of evaluation and selectivity have been developed. These techniques attempt to reduce the number of invocations of the user-defined predicates while leveraging its selectivity. These techniques are for non-indexed access and CPU cost minimization. These techniques either concentrate on ANDs alone, are not applicable for indexed access or do not attempt to factorize common predicates across them. In these techniques, the focus is on reducing the number of invocations of the expensive functions.
There is a need for a method that identifies an optimized index intersection and union-based plan for a query, by searching the space of several possible rewrites of the boolean expression. The expression is rewritten using exact and/or relaxed factors to improve the cost of indexed access to a relation.