A query is a standard way to retrieve one or more desired subsets of data from a data store. As shown by FIG. 13A, a query 1305 can originate from any computing environment 1300 where there is a framework to interface with data store 1320. In this regard, computing environment 1300 may include any number of computing devices and architectures, e.g., client/server, whether networked or standalone, which may implicate the execution of query 1305 against data store 1320. Historically, the execution of query 1305 has evolved to include an extra step prior to execution of the query: optimization of the query execution plan, i.e., optimizing the way the query is executed.
After receipt of a query 1305 from a computing environment 1300, there are a variety of optimization techniques that can be applied by a query optimizer 1310 to query 1305 to form an efficient query execution plan 1315. For instance, some existing techniques try to form a query execution plan 1315 that implicates a minimal number of page accesses, or reads, from the data store 1320 during execution of the query 1305. Other optimization techniques try to leverage pre-existing auxiliary data store structures 1330, such as indexes, histograms, re-ordered tables, alphabetized tables, etc., which are generated in advance and can be used to substantially improve the efficiency of query execution especially certain types of queries.
For a simple example showing the efficacy of auxiliary structures 1330, if a query 1305 requests all customers from a data store 1320 with a name starting with the letter “K,” one can appreciate that having access to a histogram including a “K” bucket, which shows that there are no customers having a name starting with the letter “K,” would help to streamline execution of the query. Clearly, consultation of the histogram first would be preferable over an exhaustive search of every row of the entire customer table in data store 1320 only to find that none of the rows include such entries. With the histogram, such information is obtainable directly.
In addition, the benefits of optimizing queries by using auxiliary data structures, such as histograms, tend to increase with the complexity of query 1305 itself and the corresponding amount of heavy manipulation of data that is involved. For instance, in the above example, if there are 50,000 customer rows, having a histogram that buckets according to letters will save a lot of time relative to an exhaustive search of all 50,000 customer rows when executing the query. This is because costs associated with memory access time and data computation time can become significant for more exhaustive, non-optimized approaches as the complexity of the query increases and the number of rows involved increases, making optimization comparatively worth the effort.
Accordingly, query optimizer 1310 can benefit significantly from the existence of auxiliary data store structures 1330, such as histograms, which can be used to forecast information about the data store 1320 quickly and accurately. However, due to time constraints and the like, the cost of creating an exact auxiliary data structure representing the entire population of data store 1320 may be too high. Thus, sometimes it is desirable to estimate the auxiliary data store structures 1330 based on sample population data, e.g., to estimate the hypothetical histogram representing the whole population based on a relatively small sample (e.g., 1-10%) of the population.
This is done, for instance, in the context of relational database systems that have query optimizers that attempt to estimate how many rows will be processed by a given query. In this regard, the number of rows that will be processed for a given query execution plan corresponds to how quickly a query can be executed, and thus this information can be used to compare query execution plans quickly. Today, these “number of row” estimations are made using statistical information on the data in the database, which is typically stored as histogram information that captures the distribution of values for particular objects, e.g., columns, such as “salary” or “order size”.
While there are many ways to represent a histogram, some with more information than others, for illustrative purposes, a hypothetical histogram 1360 for “order size” is provided as an example for consideration in FIG. 13B. As shown, histogram 1360 includes 7 uniformly spread buckets B1, B2, B3, B4, B5, B6 and B7, which represent buckets of orders of size 1-10 items, 11-20 items, 21-30 items, 31-40 items, 41-50 items, 51-60 and 61-70 items, respectively. The order size of 60 total orders is shown as distributed across the six buckets. The x-axis specifies ranges for the size of orders and the y-axis is the number of orders in the corresponding size range. For instance, histogram 1360 reflects that there are 24 orders that include anywhere from 21 to 30 items.
While creating a histogram for 60 total orders is not too time consuming, the cost of creating an exact histogram over an enormous amount of potentially changing data, e.g., 1,000,000+ rows, in an overall database population can be high, if not prohibitive. In such situations, histograms of the overall population data are generated based on estimates from representative samples in the database. For instance, as shown in FIG. 13C, a sample of Column C of a Table T is created by taking a random subset of pages from pages P1, P2, P3, P4, . . . , PN of Table T and examining the value of C for every row in each sampled page. Because the representative sample is taken from a random set of pages and not a random set of rows, it is called a page-level sample. In general, page-level samples provide a less representative picture of the data set than the same number of row-level samples, but page-level samples are much faster to produce because far fewer pages are loaded from disk.
Thus, histograms for distributions of columns of a table can be generated based on page-level samples and stored on disk as histogram data structures. For example, if a table has 1,000,000 rows and a sample has 10,000 rows, then the histograms for the columns of the table will each represent 10,000 rows. Then, when a database system estimates the number of rows from a table T that will qualify for a condition on a particular column C, e.g., as part of execution of a query, the system loads the histogram based on the page-level samples taken for C.
Then, according to a naïve algorithm, current systems scale the histogram produced based on the page-level samples for the column C to a representation that estimates the actual histogram for the entire set of data in C across all pages, i.e., current systems scale to an estimate of the actual histogram of all 1,000,000 rows of the table. If the representation is an accurate estimate of the actual histogram of the entire set of data for C, the estimate can then be used effectively to optimize execution of a query based on column C. However, today's approach to this scaling often results in a bad estimate, which causes queries to execute according to a sub-optimal query execution plan.
In addition to the number of samples in each bucket, histograms sometimes also represent the number of distinct values in each bucket. For instance, if a bucket for a histogram that ranges from 10 to 100 includes the values {25, 90, 25, 25, 3, 90}, then the bucket includes six values, but only three distinct values. In such cases, these numbers representing the distinct values per sample bucket are also scaled to estimates of the distinct values per bucket of the histogram of the overall population. Today, this scaling commonly consists of using the following formula to calculate the distinct value estimator D:
  d  =      D    ⁡          (              1        -                              (                          1              -                              1                D                                      )                    n                    )      where D represents the estimated number of distinct values in the entire population, d represents the number of distinct values in the sample population and n represents the size of the sample population.
Currently, given d and n, there is no closed form for calculating D. Accordingly, an iterative approach can be utilized to approximate a solution to the above formula. However, predicating scaling of page-level sample histograms to an overall population solely upon the above formula is a somewhat naïve approach. The naivety of that scaling results in inaccuracies in these estimations, particularly since the above scalar is applied to each step (i.e., each bin of the sample histogram). Accordingly, if the scalar based on the above formula for D is skewed, that skew is propagated to each of the steps of the histogram estimating the population. Better ways of scaling a histogram based on samples of a population to an estimate of the actual histogram of the entire population are thus desired.
These and other deficiencies in the state of the art of histogram scaling will become apparent upon description of the various exemplary non-limiting embodiments of the invention set forth in more detail below.