A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever. The following notice shall apply to this document: Copyright (copyright) 1999, Microsoft, Inc.
The present invention pertains generally to computer-implemented databases, and more particularly to summaries of data contained in such databases.
Online analytical processing (OLAP) is a key part of most data warehouse and business analysis systems. OLAP services provide for fast analysis of multidimensional information. For this purpose, OLAP services provide for multidimensional access and navigation of data in an intuitive and natural way, providing a global view of data that can be drilled down into particular data of interest. Speed and response time are important attributes of OLAP services that allow users to browse and analyze data online in an efficient manner. Further, OLAP services typically provide analytical tools to rank, aggregate, and calculate lead and lag indicators for the data under analysis.
In this context, a dimension is a structural attribute of a cube that is a list of members of a similar type in the user""s perception of the data. Typically, there is a hierarchy associated with the dimension. For example, a time dimension can consist of days, weeks, months, and years, while a geography dimension can consist of cities, states/provinces, and countries. Dimensions act as indices for identifying a particular cell or range of cells within a multidimensional array. Each cell contains a value, also referred to as a measurement.
Databases are commonly queried for summaries of detail data rather than individual data items. For example, a user might want to know sales data for a given period of time without regard to geographical distinctions. These types of queries are efficiently answered through the use of data tools known as aggregations. Aggregations are precomputed summaries of selected detail data that allow an OLAP system or a relational database to respond quickly to queries by avoiding collecting and aggregating detailed data during query execution. Without aggregations, the system would need to scan all of the rows containing the detailed data to answer these queries, resulting in potentially substantial processing delays. With aggregations, the system computes and materializes aggregations ahead of time so that when the query is submitted to the system, the appropriate summary already exists and can be sent to the user much more quickly.
Calculating these aggregations, however, can be costly, both in terms of processing time and in terms of disk space consumed. Several conventional OLAP systems calculate all possible summaries of the data and suffer from substantial inefficiencies when working with large databases having many dimensions. Some other conventional OLAP systems allow the user to select specific pre-calculated aggregations, avoiding the delays associated with calculating all possible aggregations. Selecting an optimal set of aggregations for a given set of queries, however, is a complicated task that most end users would find difficult to perform at best. Still other OLAP systems do not create any aggregations at all. While this approach is workable for small data volumes, it is not efficient for use with large data volumes. Certain other OLAP systems implement algorithms for selecting aggregations, but fail to adequately consider the costs of creating and maintaining the aggregations.
In those system which allow for the design of aggregations, there is generally no assurance that the aggregations designed are ever actually used, nor is there a mechanism to determine which aggregations should be materialized based on how often queries using the aggregations are issued.
Thus there is a need for a system that automatically determines a set of desirable aggregations that reflects how users are actually using the system, and that make efficient use of computing resources based on the actual use of the system.
The above-mentioned shortcomings, disadvantages and problems are addressed by the present invention, which will be understood by reading and studying the following specification.
The systems and methods described herein implement a usage based optimization system for designing aggregations in an OLAP database system. One aspect of the system is that it collects and maintains data on queries as they are issued to the OLAP system. The data collected on queries includes a frequency count comprising a number of times the query has been issued, an internal representation used to efficiently store and identify the query, an execution time comprising the amount of time the OLAP system took to answer the query, and a user identification of the user who issued the query. The system uses the usage statistics to determine which aggregations are candidates for instantiation.
A further aspect of the system is that the usage statistics determine a weighted benefit/cost ratio that the system uses to determine which of the candidate aggregations should be instantiated. The weighted benefit/cost ratio comprises a measurement of the benefit provided by the aggregation in terms of savings in the number of more detailed rows that would have been scanned in the absence of the aggregation, over the cost in terms of rows needed to store the aggregation. The ratio is weighted by multiplying the benefit by the number of queries that can make use of the aggregation.
A still further aspect of the system is that an administrator can determine which of the usage statistics should influence the weighting of the benefit/cost ratio. For example, in one aspect, the administrator can determine that only the most recently issued queries should be used in the weighting. In a further aspect, the administrator can determine that only queries issued by a particular user should be used in the weighting. In a still further aspect, the administrator can determine that only queries that exceed a certain amount of time should be used in the weighting.
The present invention describes systems, clients, servers, methods, and computer-readable media of varying scope. In addition to the aspects and advantages of the present invention described in this summary, further aspects and advantages of the invention will become apparent by reference to the drawings and by reading the detailed description that follows.