Statistics are frequently accumulated to describe data in a database, to facilitate accesses made to the data. Statistics are typically used by an optimizer to create an access plan for accessing the data. For example, when a query seeks records meeting multiple selection criteria, or a requires a join of tables, the optimizer may determine that the results may be assembled most efficiently by applying the steps in an appropriate order, or in an appropriate manner, or by allocating memory in a particular way.
For example, consider a database table (otherwise known as a relation) including columns (otherwise known as attributes) identifying vehicle owners by name and city, and the make, model, model year and other information about their vehicles. A table of this kind is shown in FIG. 1 of the drawings, which shows just eight selected rows of a much larger relation. It should also be noted that the rows of the relation are somewhat or completely randomly organized on disk with regard to all attributes, which is typical.
Queries directed to a relational database come in a number of forms. For example, a selection query seeks rows whose attributes meet specified criteria, e.g., a query that seeks rows (otherwise known as tuples) identifying the following attribute values: surname “Smith”, city name “Atlanta”, and vehicle manufacturer “VW”; that is, seeking Atlantans named Smith who own VW vehicles. This query involves forming the intersection or “AND” of the results of three selection criteria, “Smith”, “Atlanta”, and “VW”.
Another form of query is a join, which uses the content of one table to identify attributes of interest in another. For example, consider the table of FIG. 2A, which identifies the high school attended by various individuals, and the table of FIG. 2B, which identifies the current market value of various vehicles. (In both tables, only entries of interest are shown.) A user of the relational database may wish to identify the cars driven by students of, say, the “Lincoln” high school. To perform such a query, the tables of FIGS. 1 and 2A must be joined. Specifically, those tuples in the table of FIG. 2A having a high school value of “Lincoln” must be retrieved, and then used to probe the table of FIG. 1 for tuples that have a matching surname and given name.
Alternatively, a user may wish to identify the value of cars driven by the individuals in the table of FIG. 1. For this operation, the automobile identifications from the table of FIG. 1 must be used to probe the table of FIG. 2B to associate those automobiles with values.
In each join operation, there are two tables, the “outer” table, from which tuples to be matched are retrieved, and the “inner” table, which is probed for matching tuples. A query may, however, involve multiple join operations: for example, to determine the values of vehicles owned by students of “Lincoln” high school, the FIG. 2A table joined with the FIG. 1 table, and the FIG. 1 table joined with the FIG. 2B table.
A join operation is likely to require an extensive number of operations, and the efficiency with which the operation is completed may vary dramatically based upon the manner in which joins are performed, e.g., the amount of memory allocated for the join operation, the order in which values from the outer table are probed in the inner table, and, in the case of a multiple join operation, the order in which the operations are performed.
Various methods have been used for estimating the complexity, i.e., the computational cost, of a join operation. One typical costing method is to estimate the number of rows (tuples) in the outer table that will be retrieved and used to probe rows in the inner table. This estimate indicates the number of probes that will need to be made into the inner table, and thus approximates the cost of the join. Typically, one or more indeces describing the outer table are available to generate these estimates. However, this method is inaccurate, because it does not account for the manner in which data is stored in the physical implementation of the relational database system.
Unfortunately, existing methods for costing join operations suffer from a number of inaccuracies, leading to an inefficient ordering of multiple operations in the processing of a query. There remains a need of improving costing estimates of a join.