Aggregate queries play a fundamental role in many modern database and data mining applications. Range-sum query evaluation is one of the most basic problems of On-Line Analytical Processing (OLAP) Systems. Range aggregate queries can be used to provide database users a high level view of massive or complex datasets. These queries have been widely studied, but much of the research has focused on the evaluation of individual range queries. In many applications, whether the users are humans or data mining algorithms, range aggregate queries are issued in structured batches. Many recently proposed techniques can be used to evaluate a variety of aggregation operators from simple COUNT and SUM queries to statistics such as VARIANCE and COVARIANCE. Most of these methods attempt to provide efficient query answering at a reasonable update cost for a single aggregation operation. An interesting exception is that all second order statistical aggregation functions (including hypothesis testing, principle component analysis, and ANOVA) can be derived from SUM queries of second order polynomials in the measure attributes. These SUM queries can be supported using any proposed OLAP method. Higher order statistics can similarly be reduced to sums of higher order polynomials.
One way to support polynomial range-sums of any degree using existing OLAP techniques is to treat each independent monomial up to the required degree as a separate measure and build a new data cube for each of these. However, this requires the measure attributes to be specified when the database is populated, which results in unwieldy storage and maintenance cost for higher order polynomials.
For example, data consumers often specify a coarse partition of the domain of a data set, and request aggregate query results from each cell of this partition. This provides a data synopsis, which is used to identify interesting regions of the data domain. Users then drill-down into the interesting regions by partitioning them further and requesting aggregate query results on the new cells. The data consumer is interested in both the individual cell values that are returned and the way the query results change between neighboring cells.
Existing algorithms evaluate a batch of range aggregate queries by repeatedly applying any exact, approximate, or progressive technique designed to evaluate individual queries. While flexible, this approach has two major drawbacks: 1) I/O and computational overhead are not shared between queries; and 2) Approximate techniques designed to minimize single query error cannot control structural error in the result set. For example, it is impossible to minimize the error of the difference between neighboring cell values.