One of the main uses of computers is storing and retrieving large amounts of data efficiently. The computer system used for this purpose are known as database systems and the software that manages them are known as database management systems (DBMS). The DBMS facilitates the efficient management of data by (i) allowing multiple users concurrent access to a single database, (ii) restricting access to data to authorized users only, and (iii) providing recovery from system failures without loss of data integrity. The DBMS usually provides an easy to use high-level query/data manipulation language such as the Structured Query Language (SQL) as the primary interface to access the underlying data.
SQL, the most commonly used language in modern-day DBMSs, is a declarative language. Thus, it shields users from the often complex procedural details of accessing and manipulating data. Statements or commands expressed in SQL are generally issued by the user directly, using a command-line interface. The advantage of the declarative SQL is that the statements only need to specify what answer is expected, and not how it should be computed. The actual sequence by which an SQL command is computed is known as the procedural Query Evaluation Plan (QEP). The procedural QEP for a given non-procedural SQL statement is generated by the DBMS and executed to produce the query result. Typically, for a given query, there are many alternative procedural QEPs that all compute the result required. Each QEP, however, has its own cost in terms of resource use and response time. The cost is usually expressed in terms of the I/O operations such as the number of disk reads and writes, and the amount of CPU work to execute a given QEP. The problem of devising the best procedural QEP for a query so as to minimize the cost is termed query optimization.
In all brevity, given a declarative SQL query, the DBMS's Query Optimizer module determines the best possible procedural QEP to answer it. In order to do this, the query optimizer uses a model of the underlying system to select from a large set of candidate plans an efficient plan as quickly as possible. Efficiency of the QEP is measured in terms of resource utilization and response time.
The cost incurred in evaluating a QEP is proportional to the number of operations, including disk reads, writes and CPU work required to compute the final answer from the base relations. The size of the final result of a query as well as the sizes of the base relations will be the same regardless of which QEP, from among many possible candidate QEPs, is chosen by the query optimizer. Hence the cost of a QEP depends on the size of the intermediate relations generated during the computation of the query, as this is the single most important factor responsible for the difference in the costs of various QEPs of the given query. Hence by choosing a QEP that has smaller intermediate relations then other QEPs, it is possible to minimize the cost involved in computing the final result of the given query. Although this is easy to explain, due to the large number of possible alternative QEPs, computing the sizes of the intermediate relations accurately for each possible QEP is virtually an impossible task. Hence, one approach is to approximately estimate the sizes of the intermediate relations.
A complete set of glossary is found in many undergraduate database text books, including [Elmasri and Navathe, 1994], pp 137-177.
2.1 Query Optimization: An Overview
Query optimization for relational database systems is a combinatorial optimization problem, which makes exhaustive search unacceptable as the number of relations in the query increases. The query optimization process is generally divided into three distinct phases, namely query decomposition, query optimization and query execution as shown in FIG. 1.
In the query decomposition module, the declarative SQL query is first scanned, parsed and validated. The scanner sub-module identifies the language components in the text of the query, while the sparser sub-module checks the query syntax. The validator checks that all attribute and relation names are valid and semantically meaningful. The query is then translated into an internal format expressed in relational algebra in the form of a query Operator Tree. Using this operator tree as its input, the query optimizer module searches for a procedural plan with an optimal ordering of the algebraic operators. This optimal procedural plan is represented by an annotated query tree. Such trees encode procedural choices such as the order in which operators are evaluated and the method for computing each operator. Each tree node represents one or several relational operators. Annotations on the node represent the details of how it is to be executed. For example, a join node may be annotated as being executed by a hash-join, and a base relation may be annotated as being accessed by an index-scan. The choices of the execution algorithms are based on the database and system characteristics, for example, the size of the relations, available memory, type of indexes on a given attribute etc.
This fully annotated operator tree with the optimal QEP is then passed on to the query execution engine where all the low level database operations are carried out and the answer to the query is computed. An example annotated query tree is shown in FIG. 2.
2.1.1 An Example
FIG. 3 shows an example relational database from a Property Tax Assessor's office.
This database consists of three relations, namely,
    TaxPayer(SIN, CityCode, Name, DOB, Balance),    Property(Type, OwnerSIN, Tax) and    City(CityCode, CityName, Population).Assume that these relations have the following cardinalities:    |TaxPayer|=12×106 tuples,    |Property|=3×106 tuples, and    |City|=4000 tuples.Given these pieces of information, it is possible to answer the following declarative query:
Query: Find the name, city and tax information of all property owners.
To illustrate the large difference in the execution cost that exists among various QEPs, two QEP's are compared for this query using a simple cost model based on the number of I/O operations. Since the disk I/O is usually the dominant factor1 in the query response time, it is assumed the cost of evaluation a QEP is given by the total number of all tuples read and written to generate the final answer. In addition, the cost of reading the two input relations and then writing the resulting relation back on to the disk is also considered.
1Actually, a realistic cost model should include many other factors such as various join algorithms, availability of indexes and other auxiliary access methods, effects of caching and available memory, data skew etc. 
From the set of many alternatives, two possible QEPs shown in FIG. 4 are analyzed below. The QEPs execute the answer to the query as described below:
Query Evaluation Plan 1:
In the first QEP, the relations TaxPayer and City are joined to determine the city information for each person where he or she lives. This join generate an intermediate relation. This intermediate relation is then joined with the relation Property to compute the final results.
The first join requires the relations TaxPayer and City to be read. This results in 12×106+4000 reads. Assuming the result of this join is written back to the disk, it would require 12×106 writes. Note that the size of the intermediate result is 12×106. The second join requires the above intermediate relation and the relation Property to be read. This involves an additional 12×106+3×106=15×106 reads. Since the final result contains 3×106 tuples, it requires that many write operations. Hence the total number of operations required for the first QEP is N1=(12×106+4000)+(12×106)+(15×106)+(3×106)=42,004,000.
Query Evaluation Plan 2:
Joining the relation TaxPayer and Property, an intermediate relation with all the property owners can be obtained. This intermediate relation is then joined to the relation City to determine the information of the city for each tax payer.
In the second, QEP, the first join requires 12×106+3×106 reads and 3×106 writes. The second join requires another 3×106+4000 reads and 3×106 writes. Hence the total number of operations required for the second QEP is N2=(12×106+3×106)+(3×106)+(3×106+4000)+(3×106)=24,004,000, which is almost half of the first QEP.
Using a simple cost model, this short example with two different QEPs illustrates that the cost of one plan is sometimes half of the other. Most of the real-world queries are complex queries with many relations, and with a more sophisticated realistic cost model, one can generate QEPs with substantially different costs. The task of a query optimizer is to judiciously analyze the possible QEPs and choose the one with the minimal cost.
2.2 Goals of Query Optimization
The main goals of a query optimizer are that of minimizing both the response time and resource consumption. These optimization goals are often conflicting. For example, a QEP which computes the result of a query quickly but requires all available memory and CPU resources is probably not desirable because it would virtually deprive other users from accessing the database.
Finding good solutions for a query is resource intensive, but can reduce the actual evaluation cost considerably. For example, let To be the time taken to find an optimal QEP and let Tc be the time taken to execute the optimal QEP and obtain the results. If the average time taken to execute a random QEP is Tavg, then ideally To+Tc<<Tavg.
As queries get more complex in terms of the number of relations involved or alternative algorithms for computing an operator the number of potential alternative QEPs that to be considered explodes. The number of alternatives quickly increases with the number of relations etc. into the order of millions, while the differences between the cheapest and most expensive QEP can easily be several orders of magnitude. Even with simple string queries with n relations, there are (2(n−1))!/(n−1)! different join orders. For joins involving small number of relations, this number is acceptable; for example, with n=5, the number is 1680. However, as n increases, this number rises quickly. With n=10, the number of different join orders is greater than 176 billion!
As a rule of thumb, for small queries with no more than four or five relations all QEPs are generated within a few seconds. In this case the optimization time, To, is often a fraction of the response time improvement gained.
Apart from the fact that the execution space of QEPs is usually very large, the computation of a single join operation itself is one of the most time consuming tasks. To compute the join of N relations, (N−1) dyadic join operations have to be performed. Since the size of the relations joined determines the cost of a single join operation, the order in which all N relations are joined is preferably chosen in such a way so as to minimize the overall cost of computing the join of N relations.
Unfortunately, finding the optimal join order has been proven to be an NP-hard problem while, at the same time, it is one area where considerable cost benefits can be derived. Only for specific join ordering cases, exact and optimal solutions have been obtained [Boral and Zaniolo, 1986, Ibaraki and Kameda, 1984]. In addition, most of the optimization techniques proposed in the literature cannot be applied to large queries. For example, two popular early day database systems, namely, System R and Ingres use algorithms that essentially perform an exhaustive search over the entire QEP domain. This is probably adequate in their implementation because they do not allow large queries with more than 15 joins.
The problem of optimizing a query can be formally states as follows:
Problem 1 Given a relational query Q, an execution space E and a cost function C(QEP) over elements of E, find the query evaluation plan QEP∈E such that,
(1) QEP computes Q
(2) There does not exist QEP′∈E such that QEP′ also computes Q and C(QEP′)<C(QEP).
where the execution space E is the space of QEPs considered by the optimizer.
Finding an optimal QEP is a computationally intractable problem, especially when the query involves more than, say, ten relations. One obvious approach is to decrease the size of the execution space E, but although it would reduce the time and/or space requirement, it has the tradeoff of reducing the chances of finding good plans. Another approach consists of estimating the costs of QEPs using the standard statistical information available in the database catalog. In fact, most commercial DBMSs use some form of statistics on the underlying data information about the available resources in order to estimate the cost of a query plan approximately. Since these
TABLE 1Statistical Information found in a typical DBMS CatalogueNotationExplanationNRNumber of tuples in relation RbRNumber of disk blocks storing relation RsRSize of a tuple of relation R in bytesbfRBlocking factor of relation R. This is the number oftuples of relation R that fit into one disk blockδ(A, R)Number of distinct values of attribute A inrelation R.φ(A, R)Selection cardinality of attribute A in relation R.Given a relation R and an attribute A of the relation,φ(A, R) is the average number of records thatsatisfy an equality condition on attribute A.statistics are used to provide approximate estimates of query costs, the validity of the optimizer's decisions may be affected. On the other hand, since optimizers use these costs only for comparison purposes, approximate estimates for the costs of the QEPs are usually sufficient as long as these estimates are reasonably close to the actual cost of executing the QEP. For example, if the actual time units of execution of two plans P1 and P2 are 10 and 20 respectively, and the estimated times are 5 and 15, the optimizer will still pick P1 as the less expensive plan, and this will be done correctly. On the other hand, if the estimated times are 5 and 4, the optimizer will pick the suboptimal plan P2. This is an extremely important issue because costs of plans often differ considerably, and choosing a suboptimal plan often results in severe performance degradation, defeating the very purpose of query optimization.
Query optimizers make use of the statistical information stored in the DBMS catalogue to estimate the cost of a QEP. Table 1 lists some of the commonly used statistical information utilized in most of the current commercial DBMSs.
In addition to these pieces of information, most DBMSs also maintain information about the indices2 in their catalogues. These pieces of information include values such as the average fan-out, number of levels in an index etc.
2Indices refer to the actual data structures used to store and retrieve the physical records of base and intermediate relations. 
Since the actual evaluations of all possible QEPs in space E is very difficult, like all other investigators in this area, one approach is to approximately estimating the costs of QEPs.
Consider the cost C of joining two relations R1 and R2 on a common attribute X, using the nested-loop join algorithm [Selinger et al., 1979]. If bR1 and bR2 are the number of disk blocks required for relations R1 and R2 respectively, then the cost formula in terms of number of disk accesses is,   𝒞  =            b              R        1              +                  b                  R          1                    ×              b                  R          2                      +          (                        js          ×                                                R              1                                            ×                                                R              2                                                            bf                                    R              1                        ⁢                          R              2                                          )      where js is the join selectivity3 of these other operators and bfR1R2 is the blocking factor for the resulting relation.
3Selectivity of an operator is the ratio of the result size and the product of its input sizes. For the join operator, the join selectivity js=|(R1cR2)|/(|R1|×|R2|) 
From the above cost formula, it is apparent that the following quantities are crucial for the query optimizer in order to estimate the cost of such an operation:
(a) Query Result Size: The cost formula for the above join algorithm depends on the sizes of the input relations, which is true for nearly all relational operators. In a query with several operators, an input to an operator may itself be the result of another operator(s). This shows the importance of developing techniques to estimate the result sizes of these other operators.
(b) Query Result Distributions: The result size of a join or any other relational operator depends mainly on the data distribution(s) of its input relation(s). Again the input to this operator may itself be the relation which results from invoking another operator(s). This means that techniques to estimate the distribution of a query result are also desirable.
(c) Disk Access Costs: This is the cost of searching for, reading, and writing data blocks that reside on secondary storage, mainly on disk. The cost of searching for records in a file depends on the type of access structures on that file, such as its ordering, hashing, and primary or secondary indexes. In addition, factors such as whether the file blocks are allocated contiguously on the same disk cylinder or scattered on the disk affect the access cost.
The design of a good query optimizer involves developing techniques to efficiently and accurately estimate the various query result sizes. The objective of this invention is to provide new techniques that would provide more accurate query result size estimation results than the state-of-the-art methods used in the current database systems. There are other topics of concern in query optimization such as implementation algorithms for various relational operations, physical access plans for retrieving records etc., but they are not addressed herein.
In order to completely solve the above estimation problem, it is useful to develop result size estimation techniques for each of the relational operators. Most of the current commercial DBMSs employ methods based on one or more of the following techniques to estimate the above quantities:    1. Sampling based technique    2. Parametric techniques    3. Probabilistic counting techniques and    4. Non-parametric or histogram based techniques.
These techniques are briefly reviewed below. The accuracy of these estimates is often of critical importance since the selection of an optimal QEP from an exponential number of alternatives solely depends on these estimates. Despite the widespread usage of the above techniques in many commercial database systems, their accuracy has not been studied extensively.