1. Field of the Invention
The present invention is directed to creating a consistent copy of data.
The present invention is related to improving the performance of database queries.
2. Description of the Related Art
On-line analytical processing (OLAP) has become increasingly popular. Instead of reviewing piles of static reports printed on green-bar paper, an OLAP analyst can explore business results interactively, dynamically adjusting the view of the data and asking questions and getting answers almost immediately. This freedom from static answers to fixed questions on a fixed schedule allows business analysts to operate more effectively and to effect improvements in business operations.
Nigel Pendse introduced the term “FASMI” to characterize OLAP systems. The FASMI characteristics are: Fast, Analysis, Shared, Multidimensional, and Information. For further information, see N. Pendse, “What Is OLAP?” The OLAP Report.
As for fast, in keeping with the spirit of the “0” in OLAP, such systems need to provide results very quickly usually in just a few seconds, and seldom in more than 20 or 30 seconds. This level of performance is key in allowing analysts to work effectively without distraction.
As for analysis, considering the “A” in OLAP, OLAP systems generally provide rich analytic functions appropriate to a given application, with minimal programming.
As for shared, an OLAP system is usually a shared resource. This means that there is a requirement for OLAP systems to provide appropriate security and integrity features. Ultimately, this can mean providing different access controls on each cell of a database.
As for multidimensional, multidimensionality is the primary requirement for an OLAP system. OLAP products present their data in a multidimensional framework. Dimensions are collections of related identifiers, or attributes (product, market, time, channel, scenario, or customer, for example) of the data values of the system. The identifiers (“The Lord of the Rings-DVD,” “San Jose, Calif.,” “2002,” “Retail Rental,” and “John Q. Public,” for example) belonging to the collection for a particular dimension generally have some sort of structure, such as hierarchical. Sometimes there is more than one natural structure for these identifiers.
The multidimensional characteristic means that an OLAP system can quickly switch among various orientations of dimensions, as well as among various subsets and structural arrangements of a dimension. Because of the multidimensional nature of OLAP systems, the collections of data that they implement are referred to as cubes. As for information, OLAP systems store and calculate information. Data for OLAP systems often come from one or more operational systems. Analytical models are applied to these data, and the results are either stored in the system or generated at query time. The quantity of information that a particular OLAP system can manage is one characteristic of that system.
Enterprises have been storing multidimensional data, using a star or snowflake schema, in relational databases for many years. Over time, relational database vendors have added optimizations that enhance query performance on these schemas. During the 1990s many special purpose databases were developed that could handle added calculational complexity and that generally performed better than relational engines.
OLAP systems perform analysis of data that typically comes from relational databases. There are different types of OLAP systems: relational OLAP (ROLAP), hybrid OLAP (HOLAP), and multidimensional OLAP (MOLAP). The different types of OLAP systems vary in the degree to which they use relational databases. ROLAP systems issue queries directly against relational databases and analyze the results. MOLAP products have a proprietary data store, which they populate by reading from a relational database. Then, the MOLAP product responds to queries by reading from the data store. HOLAP products route selected queries to the relational database to obtain data that does not fit in the limited MOLAP data store.
Multidimensional OLAP (MOLAP) refers to the family of OLAP systems in which special-purpose file systems or indexes are used to store cube data. Express Web Publisher, Essbase™, TM1, and Pilot Suite are a few examples of products based on special-purpose storage and indexing technology. Microsoft's OLAP offering also includes a MOLAP engine. These systems are often read-only systems that are loaded with base data periodically, then derived results are calculated, stored, and indexed. Scalability of MOLAP systems is often limited by the size of the batch window within which derived results are calculated and stored. To improve scalability, such systems often have a means for deferring calculation of some derived results until query time.
For relational OLAP (ROLAP), star schemas have been used for many years as a means for representing multidimensional data in a relational database. Many commercial software development companies, such as MicroStrategy, Brio, Business Objects, Metacube, Hyperion, and Metaphor, have developed batch or interactive multidimensional reporting and exploration interfaces for relational star schemas. These systems were all designed and implemented before super aggregate operators were added to the Structured Query Language (SQL) language definition.
In particular, until a few years ago, relational databases allowed the calculation of aggregates at only a single level per query. For example, one SELECT statement with a GROUP BY clause would be used to retrieve a result set at a quarter level (i.e., for a set of quarters), while another SELECT statement with a GROUP BY clause would be used to retrieve a result set at a month level (i.e., for a set of months). This forced relational OLAP systems to run multiple queries against the database in order to calculate cells at varying levels.
To facilitate OLAP-type query creation and provide more advanced optimizations, a DB2® Relational Database Management System (RDBMS), available from International Business Machines Corporation, implemented three new super aggregate operators that were added to the SQL standard to allow a single query to generate multiple aggregates: ROLLUP, CUBE, and GROUPING SETS. These super aggregate operators are extensions to the GROUP BY clause and specify that aggregates be generated at multiple levels. For example, one SELECT statement may be used to obtain a result set of calculations of aggregates at multiple levels (e.g., both quarter and month).
Note that these super aggregate operators are more than mere shorthand for generating multiple grouping sets. Because multiple grouping sets are requested in a single statement, the DB2® RDBMS can build an execution plan that generates the grouping sets in such a way that each input row needed for the calculation is referenced only once. This can result in performance improvements of orders of magnitude, especially when the set of input rows does not fit in the buffer pool (i.e., cache).
Prior art systems are designed to produce multidimensional reports showing results with different levels of granularity by issuing multiple queries. Multiple result sets are obtained for the multiple queries, and the result sets are merged to form a single report. Such systems depend on some sort of description (metadata) of the roles for the tables and columns in a star schema for generating the necessary SQL to retrieve the data to produce the multidimensional reports. The precise metadata varies from product to product.
Multidimensional on-line analytical processing (OLAP) systems (e.g., from companies such as Hyperion, Cognos, and Microsoft) are designed to return multidimensional result sets naturally, when given sets of members for each edge of a multidimensional cube. The multidimensional OLAP systems are also designed to compute some or all of the results in advance of any query.
Multidimensional analysis has been done using SQL since the introduction of relational databases, but relational OLAP systems have not been able to return multidimensional results sets naturally or compute some or all of the results in advance of a query.
OLAP systems issue multidimensional queries. These queries may request data to be aggregated at various levels of the dimension hierarchies. The relational database system reads the data and aggregates the data to the appropriate level. In some cases, most of the time spent in responding to a query is spent reading and aggregating data rather than in returning the data. Relational databases provided materialized views in which data can be aggregated at particular levels and made persistent. A materialized view is a table that represents a view and that is generated prior to receiving a query referencing the view. If a query requests data from that level, the RDBMS can directly read the data from the materialized view rather than reading and aggregating large quantities of lower level data. If queries refer to data that is in higher levels of the hierarchies than that already aggregated, the RDBMS may be able to make use of the existing aggregation. The use of materialized views to store pre-calculated aggregates improves performance.
Determining the most cost effective materialized views to be created is based on several factors (e.g., overhead in terms of storage vs. query performance improvement) and is a complex and time consuming task.
Thus, there is a need in the art to improve the performance of a relational database management system (RDBMS) used by an OLAP system.