Since their invention, computers have been used to store extensive amounts of data in large databases. A database is defined as a collection of items, organized according to a data model and accessed via queries. For example, consider a computer database, also called a data warehouse, which comprises vast historical data, such as all sales transactions over the history of a large department store. For the purpose of decision making, such as determining whether or not to continue selling a particular item, users are often interested in analyzing the data by identifying trends in the data rather than individual records in isolation. This process usually involves posing complex aggregate user queries to large amounts of data in a database. In this case, it is often desirable for a user to access small statistical summaries of the data for the purpose of solving aggregate queries approximately, as this is significantly more efficient than accessing the entire historical data.
A fundamental problem arising in many areas of database manipulation is the efficient and accurate approximation of large data distributions using a limited amount of memory space. For example, traditionally, histograms have been used to approximate database contents for selectivity estimation in query optimizers.
Selectivity estimation is the problem of estimating the result size (or selectivity) of a query on a database. Such estimations are important in several key Database Management Systems (DBMSs) components. In particular, query optimizers use estimates for the size of intermediate relations in order to estimate the cost of different query execution plans in order to choose the one with minimum cost.
Some techniques for selectivity estimation include histograms, sampling, and parametric techniques. Of these, histogram-based techniques are the most widely used in current commercial DBMSs.
Histograms approximate the frequency distribution of one or several attributes by grouping the frequency values into buckets and approximating the frequencies inside a bucket by using certain statistics (e.g., the average or geometric mean of the frequencies) maintained for each bucket. Histograms have been studied extensively for a single attribute, and to a limited extent, for two or more attributes. The main advantages of histograms are their low time and memory space overheads, which allow for a fast and reasonable approximation of the frequencies of many common distributions.
The state-of-the-art in the histogram-based approach for selectivity estimation, however, has a conceptual and technical shortcoming. When approximating data frequency distributions, there is a natural trade-off between the accuracy of the approximation and the amount of memory space needed for its representation (i.e., the number of buckets in the histogram). The greater the number of buckets used, and corresponding greater amount of memory space used, to approximate the entire data distribution, the greater the accuracy of the approximation, i.e. the smaller the error of approximation.
All previous methods of approximation have focused on finding an approximation with minimal or small error, given a fixed amount of memory space; thus, the user has no direct means of specifying a desired error bound in the approximation. The user may wish to determine the size of the memory space that will be necessary in order to produce an approximation of a large database within a specified error of the approximation.
This problem, namely, minimizing the memory space used by the histogram given an acceptable error level for approximating the distribution, is appropriate if there is no hard limit on the memory space, but there is a need for a guaranteed bound on the error. Even in the presence of a tight memory space constraint, understanding and exploiting the trade-off between memory space and accuracy is important to decide how to allocate the available memory space to the various attributes and their histograms. Allocating the same amount of memory space to all histograms may often be a bad idea, as different histograms will have different "sweet spots" in their space-accuracy trade-offs. This problem is particularly important in applications where statistics may require a significant amount of space, e.g., approximately answering complex queries on a very large data warehouse.
Thus, problems still exist in the formulation of histogram-based techniques for selectivity estimation. The present invention has been designed to mitigate problems associated with histogram-based techniques for selectivity estimation.