1. Field of the Invention
The present invention relates to the field of databases. More particularly, the present invention relates to a method for generating histogram-based approximations in response to complex queries to a database.
2. Description of the Related Art
Extremely complex queries are posed to Database Management Systems (DBMSs) through software applications, such as decision support applications and experiment management applications. Given the exploratory nature of such queries, many of the queries require a significant amount of time to execute and then produce results that may be of no particular interest, even though the results are accurate. Answering such queries approximately has been proposed as a technique for reducing query response times when a precise answer is not necessary or when early feedback is helpful. Time can be saved when an approximate answer can be rapidly generated so that a decision can be made based on the approximate answer whether the complete execution of a query should be completed.
An approximate answer to a query is easily conceptualized when the answer to the query is, for example, an image. Instead of returning the actual image as the answer to the query, a compressed version of the image can be returned as an approximate answer. Alternatively, a series of compressed images can be retrieved, with each successive image being compressed less (more accurate) than the previously retrieved image, and with the last image retrieved being the actual image. This particular approach is provided by many existing systems, including web browsers, such as Netscape Navigator.
There has been extensive work in connection with approximate query answering ranging from theoretical foundations, to actual systems, to using query generalization for obtaining non-empty query answers. For background regarding established theoretical foundations, see, for example, P. Buneman et al., A semantics for complex objects and approximate queries, Proc. 7th ACM SIGMOD-SIGACT Symposium on Principles of Database Systems, pp. 305-314, April 1988, which is incorporated by reference herein. For background regarding actual systems, see, for example, Ozsoyoglu et al., Processing real-time non-aggregate queries with time-constraints in CASE-DB, Proc. 8th International Conference on Data Engineering, pp. 140-147, Tempe, Ariz., February 1992; and S. Vrbsky et al., APPROXIMATE: A query processor that produces monotonically improving approximate answers, IEEE Trans. on Knowledge and Data Engineering, 5(6), December 1993, each of which is incorporated by reference herein. For background regarding query generalization for obtaining non-empty query answers, see, for example, A. Motro, Query generalization: A method for interpreting null answers, L. Kerschberg, editor, Expert Database Systems, Proceedings from the First Inter-national Workshop, Benjamin/Cummings, Inc., Menlo Park, Calif., pp. 597-616, 1986, which is incorporated by reference herein. All of this work has been based on a subset/superset definition for approximations that has been obtained mostly through a partial query process.
Online aggregation is another technique used for approximate query answering, but is applicable only for queries that return aggregates, that is, individual values, not sets. The focus of online aggregation techniques has been to efficiently compute aggregates in an online fashion using an interactive interface providing continuous feedback relating to the expected distance of the current aggregate approximation from its actual value. Online aggregation has also been based on defining approximations as subset and/or supersets of an actual answer. For background regarding aggregate queries, see, for example, J. M. Hellerstein et al., Online aggregation, Proc. ACM SIGMOD Conference on the Management of Data, pp. 171-182, Tucson, Ariz., June 1997, which is incorporated by reference herein. For background regarding approximations as subsets and/or supersets of an actual answer, see, for example, P. Buneman et al., supra.
There has also been a considerable amount of work using statistical techniques for approximating data in databases, particularly in the context of selectivity estimation in query optimizers. Three widely studied classes of statistical techniques are sampling techniques, parametric techniques (approximating the data using a mathematical distribution), and histogram (or non-parametric) techniques.
For additional background regarding sampling techniques, see, for example, P. J. Haas et al., Sampling-based estimation of the number of distinct values of an attribute, Proc. of the 21st Int. Conf on Very Large Databases, pp. 311-322, 1995; R. J. Lipton et al., Practical selectivity estimation through adaptive sampling, Proc. of ACM SIGMOD Conf., pp. 1-11, May 1990; and S. Seshadri et al., Sampling issues in parallel database systems, Extending Database Technology (EDBT), pp. 328-343, March 1992, each of which is incorporated by reference herein. For background regarding parametric techniques, see, for example, C. M. Chen et al., Adaptive selectivity estimation using query feedback, Proc. of ACM SIGMOD Conf., pp. 161-172, May 1994, which is incorporated by reference herein. For background regarding histogram techniques, see, for example, Y. Ioannidis, Universality of serial histograms, Proc. of the 9th Int. Conf on Very Large Databases, pp. 256-267, December 1993; Y. Ioannidis et al., Optimal histograms for limiting worst-case error propagation in the size of join results, ACM TODS, 1993; Y. Ioannidis et al., Balancing histogram optimality and practicality for query result size estimation, Proc. of ACM SIGMOD Conf., pp. 233-244, May 1995; RP. Kooi,. The optimization of queries in relational databases, Ph.D. thesis, Case Western Reserve University, September 1980; M. V. Mannino et al., Statistical profile estimation in database systems, ACM Computing Surveys, 20(3):192-221, Sept 1988; M. Muralikrishna et al., Equi-depth histograms for estimating selectivity factors for multi-dimensional queries, Proc. of ACM SIGMOD Conf., pp. 28-36, 1988; G. Piatetsky-Shapiro et al., Accurate estimation of the number of tuples satisfying a condition, Proc. of ACM SIGMOD Conf , pp. 256-276, 1984; and V. Poosala et al., Improved histograms for selectivity estimation of range predicates, Proc. of ACM SIGMOD Conf, pp. 294-305, June 1996, each of which is incorporated by reference herein.
Of these particular techniques, histograms are probably the most widely used statistical technique in commercial database systems. For example, histograms are used in the DBMSs of Oracle, Sybase, Microsoft and IBM, because histograms occupy small amounts of space, do not incur significant overhead at estimation time and histograms are particularly suited for accurately approximating the skewed distributions arising in real-life.
Previously, several classes of histograms for building on one or more attributes have been identified. Additionally, techniques have been proposed for incrementally maintaining many of the classes of histograms up-to-date as the database is updated. For background regarding classes of histograms, see, for example, V. Poosala et al., Improved histograms for selectivity estimation of range predicates, Proc. of ACM SIGMOD Conf., pp. 294-305, June 1996; and V. Poosala et al., Selectivity estimation without the attribute value independence assumption, Proc. of the 23rd Int. Conf on Very Large Databases, August 1997, each of which is incorporated by reference herein. For background regarding techniques for maintaining many of the novel classes of histograms incrementally, see, for example, Phillip B. Gibbons et al., Fast incremental maintenance of approximate histograrms, Proc. of the 23rd Int. Conf on Very Large Databases, August 1997, which is incorporated by reference herein. Except for sampling none of the other techniques, however, have been previously studied in the context of approximate query answering.
Nevertheless, such conventional approaches are not particularly useful for many database applications because an answer to an SQL query is a relation, i.e., a (multi)set of tuples. (The term xe2x80x9csetxe2x80x9d, as used herein, means a multiset whenever no confusion can arise.) Improved insight into whether an accurate answer should be pursued is provided by a set having roughly the same number of tuples as the actual answer to a query and containing values that are approximations of the actual values (e.g., a numeric field having the approximate value of 10 instead of the actual value of 9). Additionally, the improved insight becomes more apparent when the query result is presented visually. In that regard, an analogy of a visually presented approximate answer can be drawn to an approximate (compressed) image because a large number of somewhat misplaced (approximate) points provide an improved approximation to an actual answer than a small number of actual points.
To illustrate this concept, consider an exemplary employee relational database EMP and a hypothetical query requesting values of the xe2x80x98salaryxe2x80x99, xe2x80x98agexe2x80x99, and xe2x80x98departmentxe2x80x99 attributes from the EMP database. Further assume that the query result is to be visually displayed as a set of points or elements (a starfield) in the salary-age space, such that each point represents an employee having the corresponding salary and age. For this example, the shade (or color) of each respective point represents the corresponding departmental attribute of the employee represented by the point. FIG. 1 shows an exemplary starfield display representing an actual answer to the hypothetical query. The abscissa of FIG. 1 shows the attribute salary, while the ordinate shows the attribute age. The departmental attribute for each respective element is represented by a hollow or solid point.
Consider two approximations to the query answer of FIG. 1 that is shown in FIGS. 2 and 3, and which each use the same point display scheme as FIG. 1. The first approximation, shown in FIG. 2, is a small subset of the actual answer obtained by a sample-based query processing and is based on a 20% sample. The second approximation, shown in FIG. 3, is a set having elements that are close to elements of the actual answer to the query (value proximity). In most cases, the approximation represented by FIG. 3 is preferable to the approximation represented by FIG. 2 because the approximation of FIG. 3 provides improved insight for the actual answer of FIG. 1. The natural question that arises from the example of FIGS. 1-3 is how can approximate query results of the form shown in FIG. 3 be generated.
The present invention provides a technique that can be used for generating histogram-based approximate query results. The advantages of the present invention are provided by a method for generating an approximate answer in response to a query to a database in which an SQL query Q for operating on a relation R in a database is received. According to the invention, relation R has an associated histogram H that is preferably a MaxDiff histogram. The SQL query Q is translated to be a query Qxe2x80x2 for operating on histogram H for obtaining a result histogram. The result histogram is then expanded into a relation having tuples containing approximate attribute values based on a frequency of each value. Translating the SQL query Q includes forming a relational algebra operator tree T that is equivalent to query Q. The relational algebra operator tree T includes at least one of a select operation, a project operation and a join operation. A tree Txe2x80x2 is formed by replacing each base relation in leaves of tree T by a corresponding histogram. Another histogram is generated by translating each operator into an SQL query that takes at least one histogram as an input based on children of the operator. When the SQL query Q includes a join operation, the join operation is translated into a sequence of two queries.