A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever. The following notice shall apply to this document: Copyright(copyright) 1999, Microsoft, Inc.
The present invention pertains generally to computer-implemented databases, and more particularly to summaries of data contained in such databases.
Online analytical processing (OLAP) is a key part of most data warehouse and business analysis systems. OLAP services provide for fast analysis of multidimensional information. For this purpose, OLAP services provide for multidimensional access and navigation of data in an intuitive and natural way, providing a global view of data that can be drilled down into particular data of interest. Speed and response time are important attributes of OLAP services that allow users to browse and analyze data online in an efficient manner. Further, OLAP services typically provide analytical tools to rank, aggregate, and calculate lead and lag indicators for the data under analysis.
In this context, a dimension is a structural attribute of a cube that is a list of members of a similar type in the user""s perception of the data. For example, a time dimension can consist of days, weeks, months, and years, while a geography dimension can consist of cities, states/provinces, and countries. Dimensions act as indices for identifying values within a multi-dimensional array.
Databases are commonly queried for summaries of data rather than individual data items. For example, a user might want to know sales data for a given period of time without regard to geographical distinctions. These types of queries are efficiently answered through the use of data tools known as aggregations. Aggregations are precomputed summaries of selected data that allow an OLAP system or a relational database to respond quickly to queries by avoiding collecting and aggregating detailed data during query execution. Without aggregations, the system would need to use the detailed data to answer these queries, resulting in potentially substantial processing delays. With aggregations, the system computes and materializes aggregations ahead of time so that when the query is submitted to the system, the appropriate summary already exists and can be sent to the user much more quickly.
Calculating these aggregations, however, can be costly, both in terms of processing time and in terms of disk space consumed. Therefore, in many situations, efficiencies can be realized by materializing only selected aggregations rather than all possible aggregations. The aggregations that are materialized or computed should be selected based on the implications of using or not using each aggregation. These implications include, for example, the cost associated with materializing and maintaining the aggregation. It is well known in the art that a direct relationship exists between the size of an aggregation and the cost involved in querying it.
Thus, the decision of whether to materialize a particular aggregation depends at least in part on the size of the aggregation. Some conventional solutions measure the size of an aggregation by reading and aggregating the detailed data underlying the aggregation. This approach gives an accurate result, but can itself consume considerable computing resources, especially if the aggregation summarizes a large amount of detailed data. Accordingly, a need continues to exist for a system that can estimate the size of an aggregation without reading and aggregating detailed data.
According to various example implementations of the invention, there is provided an efficient system for estimating the size of an aggregation in a database without reading and aggregating the detailed data underlying the aggregation, as described herein below. In particular, the invention provides, among other things, for using cardinalities of the levels of the aggregation and of the groups of related dimensions to estimate the aggregation size.
One particular implementation is directed to a method for estimating a size of an aggregation that aggregates detailed data in a database characterized by a plurality of dimensions. The aggregation is characterized by aggregation levels corresponding to the dimensions of the aggregation. Level cardinalities are determined for the aggregation levels of the aggregation. Dimension groups are identified that consist of dimensions that are related to each other. For each dimension group, a dimension group cardinality is determined. The size of the aggregation is estimated as a function of the level cardinalities, the dimension group cardinalities, and a size of the detailed data.
In another implementation, level cardinalities are determined for the aggregation levels of the aggregation. Dependent relationships are declared between the dimensions, and dimension groups are identified that consist of dimensions that have dependent relationships with each other. For each dimension group, a dimension group cardinality is determined as the level cardinality of a lowest level of a largest dimension of the aggregation. The size of the aggregation is estimated as a function of a product of the level cardinalities, the dimension group cardinalities, and a size of the detailed data.
Still other implementations include computer-readable media and apparatuses for performing these methods. The above summary of the present invention is not intended to describe every implementation of the present invention. The figures and the detailed description that follow more particularly exemplify these implementations.