Large amounts of data are now typically stored in computer database systems using database management system software to access the data. A relational database management system logically organizes the data into relations. A relation can be thought of as a table where each row is a tuple and each column is a component of the relation designating an attribute. Data can be extracted by querying the database for all tuples in a relation that meet certain criteria using a query statement having the general form:
SELECT &lt;Operation Specification &gt; PA1 FROM &lt;List of Tables Referenced &gt; PA1 WHERE &lt;Boolean Predicate Expression &gt; PA1 SELECT Name, Salary, Manager, Location PA1 FROM EMPLOYEE, DEPARTMENT PA1 WHERE Employee.DNO=Department.DNO and Salary between 10000 and 20000 and Floor=5
The following example query statement on an EMPLOYEE and DEPARTMENT database requests the names, salaries, managers of employees and the location of their departments for those employees who earn between 10000 and 20000 and work in departments on the fifth floor:
The system performs a join operation to connect data from two or more relations whereby tuples with matching attributes are joined to form a new tuple. A join order is the order in which join operations are performed. A join method is a technique which for performing the join of two relations.
A query language provides a set of commands, such as that shown above, for storing, retrieving and deleting data. The language for relational database management systems do not require the user to specify the access path to be used to retrieve the information or to specify the order in which joins are to be performed. Rather, the relational database management system is provided with a program called an optimizer which chooses both join order and an access path for each table in the query statement. Of the many possible choices, the optimizer chooses the access path which minimizes the "total access cost" for performing the entire statement. In the example above, the system may choose to retrieve an EMPLOYEE tuple and find all matching DEPARTMENT tuples or retrieve a DEPARTMENT tuple and find all matching EMPLOYEE tuples or to scan two order tables simultaneously. Additional information about query statements, query optimization, cost evaluation and enumeration may be obtained from "Access Path Selection in a Relational Database Management System", P. G. Selinger et al, Proceedings of ACM SIGMOD Conference, May, 1979 and "Measuring the Complexity of Join Enumeration in Query Optimization", Ono and G. M. Lohman, Proceedings of Sixteenth International Conference on Very Large Data Bases, September 1990, Brisbane, Australia, pp. 314-324, both of which are incorporated herein by reference.
Thus, query optimizers are sophisticated, highly developed systems used in relational database management systems to translate non-procedural queries into a procedural plan for execution. As mentioned, optimizers do so by developing many alternative plans, estimating the cost of execution of each plan and choosing the plan having the lowest estimated cost. Increasing the set of feasible plans that the optimizer evaluates improves, but does not guarantee, the chances that the optimizer will find a better plan while increasing the cost for it to optimize the query.
One of the major decisions an optimizer must make is with respect to the order or sequence in which to join the tables referenced in a query. In most systems, the join operation is implemented as a diadic (2-way) operator. Thus, the optimizer must generate plans that achieve N-way joins as a sequence of 2-way joins. When joining more than a few tables, the number of possible sequences is the dominant factor in the number of alternative plans: N| different sequences are possible for joining N tables. Even when dynamic programming is used, as most systems do, theoreticians have used the exponential worst case complexity to argue that heuristic search methods should be used. However, these search methods cannot guarantee optimality of their solution, as can dynamic programming.
For this reason, many existing optimizers use heuristics within dynamic programming to limit the number of join sequences evaluated. Some systems exclude plans which exclude so-called composite inners in which the inner table (the second operand of a join) is a result which must be materialized in memory or, if it is too big, on disk. While the heuristic avoids this materialization, it may exclude better plans for certain queries. Some systems restrict the class of predicates which qualify to those which are of certain form and some do not derive implied predicates, i.e. predicates which are not specified but which are implied by the query given by the user. Some systems either exclude or always defer Cartesian products as late in the join sequence as possible on the assumption that they would result in large intermediate tables. That is because, by definition, there is no join predicate between Cartesian products that restricts the results and, hence, every row in one table is joined with every row in the other table. This avoids the need for the query optimizer to search many join orderings that are unlikely to be very good since, in general, Cartesian products increase rather than decrease the size of intermediate results and, thus, cause more work to be done later. Again, such heuristics may exclude the optimal plan for certain queries that can benefit from Cartesian products. For instance, if the tables to be joined are small, and especially where they contain one tuple each, a Cartesian product is quite inexpensive and its result may have columns forming a composite key for another, much larger table to be accessed later, thus making the Cartesian product more advantageous. Obviously, these exclusions and limitations are beneficial in most cases.
Database designers often encode wide columns in a large table and store the encodings for each such column in a separate table. For example, the large table might have one column for storing a two-letter abbreviation for a state and another table for storing all of the possible abbreviations of the states and their respected unabbreviated forms; another column for storing country abbreviations and another encoding table for storing all possible country abbreviations and their unabbreviated forms; and so on. There are no join predicates relating the state encoding table and the country encoding table. Since there are no join predicates relating any of the encoding tables to each other, joining the encoding tables to each other is not considered by most join enumerators, since to the general-purpose heuristic excludes or defers Cartesian products as discussed above.
Thus, when there are multi-column indexes on the encoded columns and a query has limiting predicates on the encoding tables, the general-purpose heuristic is actually counter-productive, because it will not consider the best execution strategy, which is to first join the encoding tables as Cartesian products, and then use the rows so formed to access the data table using the multi-column index and the join predicates as start/stop key conditions.