Database applications are commonly used to store large amounts of data. One branch of database applications that is growing in popularity is Online Analytical Processing (OLAP) applications. OLAP involves the use of computers to extract useful trends and correlations from large databases of raw data. It may involve consolidating and summarizing huge databases containing millions of items (e.g., sales figures from all branches of a supermarket chain) and making this data viewable along multidimensional axes, while allowing the variables of interest to be changed at will in an interactive fashion. As such, the processing and memory load on OLAP servers is very high.
Typically, a multidimensional database stores and organizes data in a way that better reflects how a user would want to view the data than is possible in a two-dimensional spreadsheet or relational database file. Multidimensional databases are generally better suited to handle applications with large volumes of numeric data and that require calculations on numeric data, such as business analysis and forecasting, although they are not limited to such applications.
A dimension within multidimensional data is typically a basic categorical definition of data. Other dimensions in the database allow a user to analyze a large volume of data from many different perspectives. Each dimension may have a hierarchy associated with it. For example, a product group dimension may have a sublevel in the hierarchy that includes entries such as drinks and cookies. The drinks entry may then have its own sublevel of individual product identifiers for each type of drink sold. Each hierarchy may have any number of levels.
For each event, measures may be recorded. In a sales example, this may include sales amount, product identifier, location of purchase, etc. This raw information is known as input level data. This data may be stored in a multidimensional cube. This cube may be extremely large given the number of dimensions and variables typical to businesses, but it may also be extremely sparse, in that there are large gaps where no information is stored. This is because only a small percentage of the possible combinations of variables will actually be used (e.g., no customer is going to purchase every single item in stock over their lifetime, let alone in a single day).
Users typically will issue queries to the database in order to get information they want or need. These queries will typically ask for a summary of data across certain dimensions. In many applications, querying a single cell in the database is rarely needed, as the user typically would not be interested in that fine a detail. For example, in a supermarket chain database, a user may be interested in overall sales for various stores in the month of January, or sales of a certain soft drink in the Southwest over the last year, but they would probably not be interested in how much of a particular product a single customer bought on a single day. Each query run results in what is called a view of the data. As there are a large number of possible queries that can be run, there also are a large number of possible views.
In a relational database, these queries are executed dynamically at runtime, at which point the appropriate data is aggregated. While this method requires the least amount of dedicated storage space, it can be slow, especially as the size of the cube increases. Users typically aren't willing to allow a significant amount of time to run a query.
One solution is to pre-run every single possible query, and thus pre-load every possible view into a memory. While this certainly reduces runtime delays, for very large cubes it can take up a significant amount of memory and processing power. In fact, for cubes typical in many businesses, such a solution would require years of processing time, which is obviously impractical.
Therefore, it becomes beneficial to pre-load only a selected number of key views, perhaps the ones most likely to be used, into the memory. A decision must therefore be made as to which views should be selected.
Most of the previous work in view selection is based on a hypercube lattice structure. FIG. 1 is a diagram illustrating an example hypercube lattice structure. Each node of the structure represents a possible view. Here, the structure is based on a 3-dimensional cube having no hierarchies. The numbers associated with the nodes represent the number of rows for that view. These numbers are typically derived by some sort of view size estimation process. Here the number of rows of affected records is used as a rough estimate. The root node 100 in the structure represents the raw input data. A given view can be calculated from any materialized ancestor view.
In one approach to selecting which views to materialize, an iterative process is executed. At each stage in the iterative process, the benefits of selecting each possible view are calculated, and the most beneficial view is selected for materialization. The calculation of these benefits includes examining the effects on descendants of the view. This process is repeated (with the selected views removed from the set of possible views), selecting a new view to materialize at each iteration. This process repeats until a predetermined number of materialized views has been reached.
The drawback to this approach is that it requires the evaluation of every unselected view during each iteration, and each evaluation considers the effect on every descendant. This process then consumes on the order of kn2 time, where k is equal to the number of views to select and n is the number of nodes (views) in the set. While this order of complexity might at first look good as it is on the order of polynomial time, the result is misleading. The number of nodes (views) in a set is actually exponential relative to the number of dimensions in the cube. In a database with no hierarchies, that would still result in 2d possible views, where d is the number of dimensions. Thus the time complexity of the iterative approach described above is actually on the order of k22d. This results in an impractical amount of time required simply to make the selection of which views to materialize. The processing power and memory capacity required to make such calculations for large cubes is such that any system with that level of resources wouldn't need to select only certain views in the first place—it could afford to materialize all views.
Thus, it has been proposed that the iterative approach be modified to allow for a two step process. In this new approach, each iteration is divided into a nomination phase and a selection phase. The first phase nominates promising views into a candidate set. The second phase estimates the benefits of materializing each candidate. As will be seen, this approach essentially finds the path of least resistance starting from the root view (input).
This is accomplished by first, in the nomination stage, starting at the root, and then examining the direct children of the root. In FIG. 1, this would involve examining views 102, 104, and 106 to determine which one would have the least query time (assuming all already selected views have been materialized). Here 102 had the least query time. Then view 102 would be added to the list of candidates, and the direct children of 102 would be examined (108, 110). Here, 108 would have the least query time, and thus is also added to the candidate set. Once the bottom of the lattice is reached, the nomination stage ends. Here, this results in a candidate set 102, 108, 112. FIG. 2 is a diagram illustrating an example of the nomination stages of two iterations of this process.
In the selection phase, each view in the candidate set is evaluated, and the view that appears to yield the most benefit is materialized. Here, the evaluation for each candidate is performed by taking the difference with the smallest ancestor selected for materialization, and multiplying the savings by the estimated number of nodes affected. The number of nodes affected may be estimated in two steps. First, the number of descendants is calculated, including the candidate itself. Then the process looks for materialized views smaller than the candidate. If any such view is found, the process accounts for the effect of the view with the largest number of descendants in common with the candidate. The overlapping descendants are subtracted from the count of views affected by materializing the candidate, since the benefits are undercut by the other view. FIG. 3 is a diagram illustrating an example of the selection phase of the first iteration.
The view 102 is then selected for materialization during the first iteration, and is removed from the candidates. Then the second iteration may begin with another nomination phase. This continues until a predetermined number of views have been selected.
This modified approach results in on the order of (kN)2 processing time, where k is the number of views and N is the number of potential aggregates. The problem with the modified approach, however, is that it was designed without hierarchies in mind. While a number of dimensions can certainly create a moderately sized lattice, the addition of hierarchies makes even this processing time on a non-exponential order unwieldy. The nomination stage grows significantly complicated, as at each node all direct children must be examined. What is needed is a solution that further reduces the order of complexity for processing time for selecting views to be materialized for a multidimensional cube.