Since their invention, computers have been used to store extensive amounts of data in large databases. A database is defined as a collection of items, organized according to a data model and accessed via queries. For example, consider a computer database, also called a data warehouse, which comprises vast historical data, such as all sales transactions over the history of a large department store. For the purpose of decision making, such as determining whether or not to continue selling a particular item, users are often interested in analyzing the data by identifying trends in the data rather than individual records in isolation. This process usually involves posing records in isolation. This process usually involves posing complex aggregate user queries to large amounts of data in a database. In this case, it is often desirable for a user to access small statistical summaries of the data for the purpose of solving aggregate queries approximately, as this is significantly more efficient than accessing the entire historical data.
In addition, the complexity of user queries in these applications is often much higher than in traditional database applications, making it even more important to provide highly accurate selectivity estimates in order to generate inexpensive query execution plans.
A fundamental problem arising in many areas of database implementation is the efficient and accurate approximation of large data distributions using a limited amount of memory space within a reasonable amount of time. Traditionally, histograms have been used to approximate database contents for selectivity estimation in query optimizers.
A histogram approximates the data distribution of an attribute by grouping values into subsets, known as buckets, and using the summary statistics of the values in the buckets. They are the most commonly used form of statistics in practice in commercially available database systems because they incur almost no run-time overhead, and they are effective given even a small memory budget.
There has been extensive work on histograms for one-dimensional data. The traditional histograms are equi-width and equi-depth. More recently proposed one-dimensional histograms such as V-Optimal, MaxDiff, Compressed and Spline-based are more accurate; their taxonomy and optimality for estimating various query operators are reasonably well understood. See Viswanath Poosala, "Histogram-based Estimation Techniques in Databases," Ph.D. Thesis, University of Wisconsin-Madison, 1997. While equi-width, equi-depth, MaxDiff and Compressed histograms are easy to compute, efficient computation of the V-Optimal or the Spline-based histograms have not yet been developed. These are important classes of histograms since they are well suited for estimating result sizes of queries such as equality joins, selection, and range queries.
Approximating arbitrary, large data distributions by a small amount of "summary data", such that this summary data can then be used with no (or limited) further access to the entire distribution creates problems in data processing, since some accuracy will usually be lost in the approximation process. However, the smaller summary data can now be manipulated more efficiently in terms of processing time and memory space. A number of traditional and emerging applications, as defined below, rely on such approximations.
One such application, which relies on approximating arbitrary, large data distributions by a small amount of summary data, is known as Selectivity Estimation for Query Optimization. Commercial Database Management Systems (DBMSs) rely on selectivity estimation, that is, estimates of the result size of individual operators, in order to choose the most appropriate query execution plan for complex queries. This calls for very fast and reasonably accurate estimation techniques. The most common solution is to use histograms, which approximate the data distribution with a limited number of buckets as previously described.
Another application which relies on approximating arbitrary, large data distributions by a small amount of summary data is known as Providing Security by Summarizing Data. Data distributions may be approximated for preserving privacy. The idea is to summarize the distribution over subsets of domain values and make only the summary information publicly available for statistical analysis. This is commonly practiced in disseminating Census Data.
The need to approximate data distribution using limited summary data also arises in other application areas such as in Statistics, Data Compression, etc. Within the database community, a number of general approaches to the problem of approximating arbitrary, large data distributions by a small amount of summary data such that the summary data can then be used with no or little further access to the entire distribution have been developed and explored, mainly for selectivity estimation purposes. These include techniques that rely on precomputing information such as histograms, parametized mathematical distributions such as Zipf, and techniques that collect information at run-time, such as random sampling. These fundamentally different techniques perform satisfactorily under different favorable settings. For example, histograms are known to perform best for data with uniform value domains, frequencies that are skewed very high or very low, or with independencies among the attributes. Parametric techniques perform best when the underlying data closely follows a known mathematical distribution such as Zipf, multifractals, or bounded degree polynomials.
However, in general, data almost never satisfies such favorable assumptions in its entirety, although parts of it may well do so. Furthermore, depending on the particular query operator in selectivity estimation, many different measures of the error in approximating the data distribution may need to be optimized. In the absence of an elaborate understanding of the algorithms needed to solve such complex optimization problems, many known solutions for these techniques are based on heuristics. In general, they are either highly inaccurate or inefficient for approximating arbitrary data distribution for a variety of applications. Thus, the problem of approximating the data distribution of a database using a limited amount of memory space in a reasonable amount of time still remains.
The present invention has been designed to overcome some of the problems associated with the approximation of a large data distribution of a database to allow a user to analyze the large data distribution utilizing a smaller and more manageable, in both memory space and time considerations, approximation of the large data distribution.