Structured Query Language (SQL) is a popular computer language employed to create, modify, retrieve and manipulate data from relational database management systems. In general, the SQL language has evolved beyond its original scope to support object-relational database management systems, for example. Another type of query language includes language integrated query (LINQ) which applies to a set of operating system framework extensions that encompass language-integrated query, set, and transform operations. For example, these framework extensions can extend C# and Visual Basic with native language syntax for queries and provide class libraries to take advantage of these capabilities. These and other query languages such as SQL fall under the broad umbrella of relational database systems.
In relational database systems, an optimizer is a module that builds a plan of how a query will be executed. Related to this concept is query cardinality estimation that plays an important role in the query optimization process. Inaccurate cardinality estimation can lead to suboptimal plans being selected, where execution times between optimal and sub-optimal plans may differ by orders of magnitude. The problem of query cardinality estimation is important for database systems, since query optimizers need accurate estimates of the sizes of intermediate result sets with respect to hypothetical query plans for the query. Query optimizers use size estimation methods to select a better query execution plan among numerous candidates. A query plan is, typically, a tree, where the estimation methods are usually called for each node of the tree. In order to estimate the outgoing data flow of each node, it is necessary to estimate the sizes of its sub-tree's data flow, which is the incoming flow for the node. Accuracy of estimation is important since compound error in estimation grows exponentially with respect to the number of join operations in a query, for example.
Another application for query cardinality estimation is with respect to sophisticated user interfaces. They provide an estimation of the query result size to the user before the actual query execution, where such information can help a system or administrator to allocate the right amount of memory for an application or to detect misconceptions about the query or the data. Such interfaces can also depend on the accuracy of the size estimation.
There are many effective statistical solutions for estimation problems. However, virtually none of them can be applied to the query size estimation problem. The reasons for this follow from the typical relational database architecture. For instance, at the optimizer stage, various query execution plans are considered. The optimizer generally selects one—hopefully a near optimal plan, for query execution. The optimizer is also a part of a static compilation, where the underlying data is not accessed at this stage. The goal of the optimizer is to select a suitable plan for execution, where the chosen plan may be not the absolute best, since the problem of finding the best plan may not be worth the time it takes, but it should be ultimately an acceptable plan. Expensive plans (to execute) should be avoided by the optimizer however.
The cost of the plan generally depends on the intermediate cardinalities in the plan nodes. This is why knowledge of intermediate query cardinalities in the query tree is employed for pruning among plans. The specifics of the query cardinality estimation problem in relational database systems are that the information should be available to the query optimizer at the query compilation stage. Another requirement is that query compilation should take a negligible percentage of the query processing time, and, in particular, the optimization time should be low. The data set is typically large and does not fit in main memory, and most of the data resides on the disk. Scanning the data requires multiple disk accesses, which can involve high I/O costs. Therefore, scanning the data is generally too expensive, and is usually avoided in the compilation stage. (Sampling-based techniques are an exception as they do access the data). Information about the underlying data distributions should be provided to the optimizer without need to access the data itself. In commercial database systems, histograms and other statistics are stored in catalog tables, created and maintained by the system for its own use.
During the query compilation stage, metadata is loaded into designated internal data structures. The statistics stored for optimization should be as compact as possible in order to use little of the main memory. The statistics are updated periodically and incrementally, usually off-line, using as few re-scans of data as possible. This makes usage of many existing statistical methods, applicable in other areas, inappropriate for the query cardinality estimation problem.
Ascending descending database column values are common in many applications (e.g., increasing order numbers, customer ids, date/time set to current date/time, and so forth). However, these type ascending (or descending) values pose a hard problem to optimizer statistics and cardinality estimation. Statistics generated on these columns are soon out-of-date since all new values are by definition outside of the known range of the statistics (e.g., histogram). In addition, it is very common to query for new values, where typical queries inquire for values which are often outside of the known range. This leads to poor cardinality estimates which can lead to slow execution plans.