1. Field of the Invention
The present invention relates to relational databases and more specifically, to obtaining approximate answers to aggregate queries in relational databases.
2. Discussion of Related Art
Databases are traditionally defined as repositories of facts about some aspect of the real world and a database management system (DBMS) is primarily required to simply provide an environment that is both convenient and efficient to use in retrieving and storing database information. It is implied that the DBMS should provide accurate answers to queries on the data, i.e., answers that are consistent with the real world. In fact, accurate answers are essential in nearly all common applications of databases.
Although obtaining accurate answers to queries on the data is almost a prerequisite in a traditional setting, there is an increasing necessity for a DBMS to provide approximate answers to queries, i.e., answers that are only "nearly" consistent with the real world. The main motivation for approximate answers comes from two very different kinds of applications: decision support applications and real-time applications. Decision support applications, also known as on-line analytical processing (OLAP) applications, analyze the data in a data warehouse (often related to a business enterprise) to glean important trend information. This typically involves complex aggregate queries operating on large amounts of data thereby incurring large processing times. These delays are intolerable when users need to analyze the data interactively. Similarly, some real-time applications, such as telecom switches, need to make very quick decisions based on the data, which is often impossible if the entire data needs to be analyzed. Fortunately, many of these applications can readily tolerate "small" errors in query results in exchange for significantly lower response times, as illustrated below.
One example of "small" error tolerance may be given in the context of market analysis performed by a large multi-national corporation to analyze sales data over the past several years to find a nation with a potential market for the company's products. In such an application, the first step in the analysis is to compute an aggregate, such as the total dollar value of sales in order to rank the continents, and there is likely little interest in answers accurate to the last cent. Once an interesting continent is selected, a much more detailed analysis is performed on the nations within that continent, this time requiring an exact answer in order to prepare a report.
A second example may be given in the telecommunications area, where telecommunication switches are used to route calls based on current traffic load on various available channels. Obviously, speed is of the essence in this situation. Since the traffic data can be very large, the switches typically query a dynamically maintained summary of the current load. Due to the approximate nature of the data, the switch may sometimes select a sub-optimal route, but this is not a critical hazard to the switch's operation, as long as it is not too frequent.
These examples point out two different ways an approximate answer can be used (applicable to both kinds of applications): for pruning the queried data and as the ultimate answer. In either case, the response time is likely to be drastically reduced. The new requirement of a DBMS therefore is to provide approximate answers quickly but with reasonable accuracy. Fortunately, database research and most DBMSs already contain the necessary preliminary tools to solve this problem in the form of statistics.
Statistics, such as samples and histograms, are used extensively by many DBMSs to perform critical estimations. For example, query optimizers rely on accurate selectivity estimates in order to identify an optimal execution plan and parallel query execution engines benefit from estimates of query result distributions. In these applications, statistics are used to approximate the frequency distributions of attributes in the database relations, and almost never used for query answering.
Sampling-based statistical techniques for providing a series of increasingly accurate answers culminating in the correct answer to aggregate queries are known. Their main emphasis, however, is on efficient query processing techniques and probabilistic guarantees on the partial results of a given query, and not on using statistics to provide a single approximate answer to OLAP queries very quickly.
There are two reasons that make careful statistics selection critical and rewarding in OLAP. First, the size of the data cube (database relations) is very large and demands on the accuracy of an approximate query answering system are very high, requiring large amounts of space for the histograms. Second, the interplay between histograms on various sub-cubes allows distribution of space among the histograms such that the accuracy requirements are satisfied with minimal resources.
The above problem also arises for almost all kinds of statistics and in many applications. For example, a common challenge faced by the database administrator of a DBMS is identifying the correct set of attributes to build statistics on for accurate selectivity estimates. Unfortunately, most work related to database statistics focuses on identifying a specific instance of statistics to be used for an estimation problem (typically, selectivity estimation), but rarely deals with the collection of statistics on a relation.
Histograms are usually used to approximate the frequency distributions of one or more attributes of a relation by grouping the data into subsets (buckets) and making uniformity assumptions within each subset. A histogram is a bar graph in which the area of each bar is proportional to the frequency or relative frequency presented. The main advantages of histograms are that they incur almost no run-time overhead, they do not require the data to fit a probability distribution or a polynomial and, for most real-world databases, there exist histograms that produce low-error estimates while occupying reasonably small space. Hence, they are the most commonly used form of statistics in practice (e.g., they are used in DB2, Informix, Ingres, Oracle, and Sybase database systems), but are used for approximating mostly single-dimensional frequency distributions in the context of selectivity estimation.
OLAP users find it useful to organize data along several dimensions of a multidimensional data cube and perform aggregate analysis on (possibly subsets of) the dimensions. The cells of the data cube contain the corresponding value of a measured attribute. For a large relation, the data cube is very large (several gigabytes) and answering even simple queries requires significant amounts of time. This problem has been addressed by precomputing parts of the data cube, building indexes, and using efficient techniques for computing the cube and answering queries. All of this, however, deals with providing exact, not approximate, answers to queries.
There is a need, therefore, of an efficient method using statistics to summarize database data in order to provide quick and approximately accurate answers to aggregate queries. Furthermore, because an important issue in the use of statistics is determining a configuration that minimizes estimation error for queries in a database, there is a need for a method of determining a configuration that minimizes estimation errors, while at the same time maximizing the use of computer resources (e.g., space). Many of the known techniques for minimizing error given a set amount of space suffer from significant disadvantages.
An OPTIMAL algorithm for minimizing error given space, for example, operates as follows. The total number of possible histogram configurations that can be generated on N sub-cubes given a certain amount of space .beta. is combinatorial in .beta.,N. This value is very high for most realistic values of .beta.. The OPTIMAL algorithm identifies each possible allocation, computes its error Es (sum of Es over all sub-cubes), and finds the histogram configuration that minimizes the total error as the best. Although this algorithm is accurate, it requires huge resources, as it is clearly exponential in N.
Another known algorithm uses a standard randomized technique, such as iterative improvement. The optimum value over all starting points in this RANDOM algorithm is then picked as the "optimal" solution.
Moving away from optimality, a third known algorithm is a NAIVE space allocation scheme, which divides the available space, .beta., equally among the n histograms, one on each sub-cube, Si. This algorithm takes a constant amount of time but it does not take the interactions between histograms on different attribute combinations into account. A query on Si would be answered by the direct operation on histogram H(Si,.beta./n). The resultant configurations obtained through this algorithm, however, are not as good as other configurations because they do not take advantage of the dependencies between sub-cubes.
Both the OPTIMAL and NAIVE solutions are unsatisfactory, the former because it is computationally expensive to build, and the latter because of its unsatisfactory performance as far as accuracy of the configurations it generates is concerned.
There is a need, therefore, for a method to determine a histogram configuration that minimizes estimation errors, while at the same time maximizing the use of computer resources. More explicitly, there is a need to identify accurate histogram classes and distribute the space among the histograms on various sub-cubes such that the errors are minimized. This is particularly critical for large database approximations due to large data volumes and potentially high accuracy requirements.