1. Field of the Invention
The present invention relates to a method of and system for aggregating data elements in a multi-dimensional database (MDDB) supported upon a computing platform and also to provide an improved method of and system for managing data elements within a MDDB during on-line analytical processing (OLAP) operations.
2. Description of the Related Art
The ability to act quickly and decisively in today's increasingly competitive marketplace is critical to the success of organizations. The volume of information that is available to corporations is rapidly increasing and frequently overwhelming. Those organizations that will effectively and efficiently manage these tremendous volumes of data, and use the information to make business decisions, will realize a significant competitive advantage in the marketplace.
Data warehousing, the creation of an enterprise-wide data store, is the first step towards managing these volumes of data. The Data Warehouse is becoming an integral part of many information delivery systems because it provides a single, central location where a reconciled version of data extracted from a wide variety of operational systems is stored. Over the last few years, improvements in price, performance, scalability, and robustness of open computing systems have made data warehousing a central component of Information Technology CIT strategies. Details on methods of data integration and constructing data warehouses can be found in the white paper entitled “Data Integration: The Warehouse Foundation” by Louis Rolleigh and Joe Thomas.
Building a Data Warehouse has its own special challenges (e.g. using common data model, common business dictionary, etc.) and is a complex endeavor. However, just having a Data Warehouse does not provide organizations with the often-heralded business benefits of data warehousing. To complete the supply chain from transactional systems to decision maker, organizations need to deliver systems that allow knowledge workers to make strategic and tactical decisions based on the information stored in these warehouses. These decision support systems are referred to as On-Line Analytical Processing (OLAP) systems. OLAP systems allow knowledge workers to intuitively, quickly, and flexibly manipulate operational data using familiar business terms, in order to provide analytical insight into a particular problem or line of inquiry. For example, by using an OLAP system, decision makers can “slice and dice” information along a customer (or business) dimension, and view business metrics by product and through time. Reports can be defined from multiple perspectives that provide a high-level or detailed view of the performance of any aspect of the business. Decision makers can navigate throughout their database by drilling down on a report to view elements at finer levels of detail, or by pivoting to view reports from different perspectives. To enable such full-functioned business analyses, OLAP systems need to (1) support sophisticated analyses, (2) scale to large numbers of dimensions, and (3) support analyses against large atomic data sets. These three key requirements are discussed further below.
Decision makers use key performance metrics to evaluate the operations within their domain, and OLAP systems need to be capable of delivering these metrics in a user-customizable format. These metrics may be obtained from the transactional databases precalculated and stored in the database, or generated on demand during the query process. Commonly used metrics include:
(1) Multidimensional Ratios (e.g. Percent to Total):
“Show me the contribution to weekly sales and category profit made by all items sold in the Northwest stores between July. 1 and July. 14.”
(2) Comparisons (e.g. Actual vs. Plan, This Period vs. Last Period):
“Show me the sales to plan percentage variation for this year and compare it to that of the previous year to identify planning discrepancies.”
(3) Ranking and Statistical Profiles (e.g. Top N/Bottom N, 70/30, Quartiles):
“Show me sales, profit and average call volume per day for my 20 most profitable salespeople, who are in the top 30% of the worldwide sales.”
(4) Custom Consolidations:
“Show me an abbreviated income statement by quarter for the last two quarters for my Western Region operations.”
Knowledge workers analyze data from a number of different business perspectives or dimensions. As used hereinafter, a dimension is any element or hierarchical combination of elements in a data model that can be displayed orthogonally with respect to other combinations of elements in the data model. For example, if a report lists sales by week, promotion, store, and department, then the report would be a slice of data taken from a four-dimensional data model.
Target marketing and market segmentation applications involve extracting highly qualified result sets from large volumes of data. For example, a direct marketing organization might want to generate a targeted mailing list based on dozens of characteristics, including purchase frequency, size of the last purchase, past buying trends, customer location, age of customer, and gender of customer. These applications rapidly increase the dimensionality requirements for analysis.
The number of dimensions in OLAP systems range from a few orthogonal dimensions to hundreds of orthogonal dimensions. Orthogonal dimensions in an exemplary OLAP application might include Geography, Time, and Products.
Atomic data refers to the lowest level of data granularity required for effective decision making. In the case of a retail merchandising manager, “atomic data” may refer to information by store, by day, and by item. For a banker, atomic data may be information by account, by transaction, and by branch. Most organizations implementing OLAP systems find themselves needing systems that can scale to tens, hundreds, and even thousands of gigabytes of atomic information.
As OLAP systems become more pervasive and are used by the majority of the enterprise, more data over longer time frames will be included in the data store (i.e. data warehouse), and the size of the database will increase by at least an order of magnitude. Thus, OLAP systems need to be able to scale from present to near-future volumes of data.
In general, OLAP systems need to (1) support the complex analysis requirements of decision-makers, (2) analyze the data from a number of different perspectives (i.e. business dimensions), and (3) support complex analyses against large input (atomic-level) data sets from a Data Warehouse maintained by the organization using a relational database management system (RDBMS).
Vendors of OLAP systems classify OLAP Systems as either Relational OLAP (ROLAP) or Multidimensional OLAP (MOLAP) based on the underlying architecture thereof. Thus, there are two basic architectures for On-Line Analytical Processing systems: The ROLAP Architecture, and the MOLAP architecture.
Overview of the Relational OLAP (ROLAP) System Architecture
The Relational OLAP (ROLAP) system accesses data stored in a Data Warehouse to provide OLAP analyses. The premise of ROLAP is that OLAP capabilities are best provided directly against the relational database, i.e. the Data Warehouse.
The ROLAP architecture was invented to enable direct access of data from Data Warehouses, and therefore support optimization techniques to meet batch window requirements and provide fast response times. Typically, these optimization techniques include application-level table partitioning, pre-aggregate inferencing, denormalization support, and the joining of multiple fact tables.
As shown in FIG. 1A, a typical prior art ROLAP system has a three-tier or layer client/server architecture. The “database layer” utilizes relational databases for data storage, access, and retrieval processes. The “application logic layer” is the ROLAP engine which executes the multidimensional reports from multiple users. The ROLAP engine integrates with a variety of “presentation layers,” through which users perform OLAP analyses.
After the data model for the data warehouse is defined, data from on-line transaction-processing (OLTP) systems is loaded into the relational database management system (RDBMS). If required by the data model, database routines are run to pre-aggregate the data within the RDBMS. Indices are then created to optimize query access times. End users submit multidimensional analyses to the ROLAP engine, which then dynamically transforms the requests into SQL execution plans. The SQL execution plans are submitted to the relational database for processing, the relational query results are cross-tabulated, and a multidimensional result data set is returned to the end user. ROLAP is a fully dynamic architecture capable of utilizing precalculated results when they are available, or dynamically generating results from atomic information when necessary.
Overview of MOLAP System Architecture
Multidimensional OLAP (MOLAP) systems utilize a proprietary multidimensional database (MDDB) to provide OLAP analyses. The main premise of this architecture is that data must be stored multidimensionally to be accessed and viewed multidimensionally.
As shown in FIG. 1B, prior art MOLAP systems have an Aggregation, Access and Retrieval module which is responsible for all data storage, access, and retrieval processes, including data aggregration (i.e. preaggregation) in the MDDB. As shown in FIG. 1B, the base data loader is fed with base data, in the most detailed level, from the Data Warehouse, into the Multi-Dimensional Data Base (MDDB). On top of the base data, layers of aggregated data are built-up by the Aggregation program, which is part of the Aggregation, Access and Retrieval module. As indicated in this figure, the application logic module is responsible for the execution of all OLAP requests/queries (e.g. ratios, ranks, forecasts, exception scanning, and slicing and dicing) of data within the MDDB. The presentation module integrates with the application logic module and provides an interface, through which the end users view and request OLAP analyses on their client machines which may be web-enabled through the infrastructure of the Internet. The client/server architecture of a MOLAP system allows multiple users to access the same multidimensional database (MDDB).
Information (i.e. basic data) from a variety of operational systems within an enterprise, comprising the Data Warehouse, is loaded into a prior art multidimensional database (MDDB) through a series of batch routines. The Express™ server by the Oracle Corporation is exemplary of a popular server which can be used to carry out the data loading process in prior art MOLAP systems. As shown in FIG. 2B an exemplary 3-D MDDB is schematically depicted, showing geography, time and products as the “dimensions” of the database. The multidimensional data of the MDDB is organized in an array structure, as shown in FIG. 2C. Physically, the Express™ server stores data in pages (or records) of an information file. Pages contain 512, or 2048, or 4096 bytes of data, depending on the platform and release of the Express™ server. In order to look up the physical record address from the database file recorded on a disk or other mass storage device, the Express™ server generates a data structure referred to as a “Page Allocation Table (PAT)”. As shown in FIG. 2D, the PAT tells the Express™ server the physical record number that contains the page of data. Typically, the PAT is organized in pages. The simplest way to access a data element in the MDDB is by calculating the “offset” using the additions and multiplications expressed by a simple formula:Offset=Months+Product*(#of_Months)+City*(#of_Months*#of_Products)
During an OLAP session, the response time of a multidimensional query on a prior art MDDB depends on how many cells in the MDDB have to be added “on the fly”. As the number of dimensions in the MDDB increases linearly, the number of the cells in the MDDB increases exponentially. However, it is known that the majority of multidimensional queries deal with summarized high level data. Thus, as shown in FIGS. 3A and 3B, once the atomic data (i.e. “basic data”) has been loaded into the MDDB, the general approach is to perform a series of calculations in batch in order to aggregate (i.e. pre-aggregate) the data elements along the orthogonal dimensions of the MDDB and fill the array structures thereof. For example, revenue figures for all retail stores in a particular state (i.e. New York) would be added together to fill the state level cells in the MDDB. After the array structure in the database has been filled, integer-based indices are created and hashing algorithms are used to improve query access times. Pre-aggregation of dimension D0 is always performed along the cross-section of the MDDB along the D0 dimension.
As shown in FIG. 3C2, the primarily loaded data in the MDDB is organized at its lowest dimensional hierarchy. As shown in FIGS. 3C1 and 3C3, the results of the pre-aggregations are stored in the neighboring parts of the MDDB.
As shown in FIG. 3C2, along the TIME dimension, weeks are the aggregation results of days, months are the aggregation results of weeks, and quarters are the aggregation results of months. While not shown in the figures, along the GEOGRAPHY dimension, states are the aggregation results of cities, countries are the aggregation results of states, and continents are the aggregation results of countries. By pre-aggregating (i.e. consolidating or compiling) all logical subtotals and totals along all dimensions of the MDDB, it is possible to carry out real-time MOLAP operations using a multidimensional database (MDDB) containing both basic (i.e. atomic) and pre-aggregated data. Once this compilation process has been completed, the MDDB is ready for use. Users request OLAP reports by submitting queries through the OLAP Application interface (e.g. using web-enabled client machines), and the application logic layer responds to the submitted queries by retrieving the stored data from the MDDB for display on the client machine.
Typically, in MDDB systems, the aggregated data is very sparse, tending to explode as the number of dimension grows and dramatically slowing down the retrieval process (as described in the report entitled “Database Explosion: The OLAP Report”, incorporated herein by reference). Quick and on line retrieval of queried data is critical in delivering on-line response for OLAP queries. Therefore, the data structure of the MDDB, and methods of its storing, indexing and handling are dictated mainly by the need of fast retrieval of massive and sparse data.
Different solutions for this problem are disclosed in the following US Patents, each of which is incorporated herein by reference in its entirety: [0036] U.S. Pat. No. 5,822,751 “Efficient Multidimensional Data Aggregation Operator Implementation” [0037] U.S. Pat. No. 5,805,885 “Method And System For Aggregation Objects” [0038] U.S. Pat. No. 5,781,896 “Method And System For Efficiently Performing Database Table Aggregation Using An Aggregation Index” [0039] U.S. Pat. No. 5,745,764 “Method And System For Aggregation Objects”
In all the prior art OLAP servers, the process of storing, indexing and handling MDDB utilize complex data structures to largely improve the retrieval speed, as part of the querying process, at the cost of slowing down the storing and aggregation. The query-bounded structure, that must support fast retrieval of queries in a restricting environment of high sparsity and multi-hierarchies, is not the optimal one for fast aggregation.
In addition to the aggregation process, the Aggregation, Access and Retrieval module is responsible for all data storage, retrieval and access processes. The Logic module is responsible for the execution of OLAP queries. The Presentation module intermediates between the user and the logic module and provides an interface through which the end users view and request OLAP analyses. The client/server architecture allows multiple users to simultaneously access the multidimensional database.
In summary, general system requirements of OLAP systems include: (1) supporting sophisticated analysis, (2) scaling to large number of dimensions, and (3) supporting analysis against large atomic data sets.
MOLAP system architecture is capable of providing analytically sophisticated reports and analysis functionality. However, requirements (2) and (3) fundamentally limit MOLAP's capability, because to be effective and to meet end-user requirements, MOLAP databases need a high degree of aggregation.
By contrast, the ROLAP system architecture allows the construction of systems requiring a low degree of aggregation, but such systems are significantly slower than systems based on MOLAP system architecture principles. The resulting long aggregation times of ROLAP systems impose severe limitations on its volumes and dimensional capabilities.
The graphs plotted in FIG. 5 clearly indicate the computational demands that are created when searching an MDDB during an OLAP session, where answers to queries are presented to the MOLAP system, and answers thereto are solicited often under real-time constraints. However, prior art MOLAP systems have limited capabilities to dynamically create data aggregations or to calculate business metrics that have not been precalculated and stored in the MDDB.
The large volumes of data and the high dimensionality of certain market segmentation applications are orders of magnitude beyond the limits of current multidimensional databases.
ROLAP is capable of higher data volumes. However, the ROLAP architecture, despite its high volume and dimensionality superiority, suffers from several significant drawbacks as compared to MOLAP: [0048] Full aggregation of large data volumes are very time consuming, otherwise, partial aggregation severely degrades the query response. [0049] It has a slower query response [0050] It requires developers and end users to know SQL [0051] SQL is less capable of the sophisticated analytical functionality necessary for OLAP [0052] ROLAP provides limited application functionality.
Thus, improved techniques for data aggregation within MOLAP systems would appear to allow the number of dimensions of and the size of atomic (i.e. basic) data sets in the MDDB to be significantly increased, and thus increase the usage of the MOLAP system architecture.
Also, improved techniques for data aggregation within ROLAP systems would appear to allow for maximized query performance on large data volumes, and reduce the time of partial aggregations that degrades query response, and thus generally benefit ROLAP system architectures.
Thus, there is a great need in the art for an improved way of and means for aggregating data elements within a multi-dimensional database (MDDB), while avoiding the shortcomings and drawbacks of prior art systems and methodologies.