Database applications are commonly used to store large amounts of data. One branch of database applications that is growing in popularity is Online Analytical Processing (OLAP) applications. OLAP involves the use of computers to extract useful trends and correlations from large databases of raw data. It may involve consolidating and summarizing huge databases containing millions of items (e.g., sales figures from all branches of a supermarket chain) and making this data viewable along multidimensional axes, while allowing the variables of interest to be changed at will in an interactive fashion. As such, the processing and memory load on OLAP servers is very high.
Typically, a multidimensional database stores and organizes data in a way that better reflects how a user would want to view the data than is possible in a two-dimensional spreadsheet or relational database file. Multidimensional databases are generally better suited to handle applications with large volumes of numeric data and that require calculations on numeric data, such as business analysis and forecasting, although they are not limited to such applications.
A dimension within multidimensional data is typically a basic categorical definition of data. Other dimensions in the database allow a user to analyze a large volume of data from many different perspectives. Each dimension may have a hierarchy associated with it. For example, a product group dimension may have a sublevel in the hierarchy that includes entries such as drinks and cookies. The drinks entry may then have its own sublevel of individual product identifiers for each type of drink sold. Each hierarchy may have any number of levels.
For each event, measures may be recorded. In a sales example, this may include sales amount, product identifier, location of purchase, etc. This raw information is known as input level data. This data may be stored in a multidimensional cube. This cube may be extremely large given the number of dimensions and variables typical to businesses, but it may also be extremely sparse, in that there are large gaps where no information is stored. This is because only a small percentage of the possible combinations of variables will actually be used (e.g., no customer is going to purchase every single item in stock over their lifetime, let alone in a single day).
Users typically will issue queries to the database in order to get information they want or need. These queries will typically ask for a summary of data across certain dimensions. In many applications, querying a single cell in the database is rarely needed, as the user typically would not be interested in that fine a detail. For example, in a supermarket chain database, a user may be interested in overall sales for various stores in the month of January, or sales of a certain soft drink in the Southwest over the last year, but they would probably not be interested in how much of a particular product a single customer bought on a single day.
In a relational database, these queries are executed dynamically at runtime, at which point the appropriate data is aggregated. While this method requires the least amount of dedicated storage space, it can be slow, especially as the size of the cube increases. Users typically aren't willing to allow a significant amount of time to run a query.
One solution is to pre-run every single possible query, and materialize the results on disk. While this certainly reduces runtime delays, for very large cubes it can take up a significant amount of memory and processing power. In fact, for cubes typical in many businesses, such a solution would require years of processing time, which is obviously impractical.
Therefore, it is beneficial to materialize only those queries where the benefit of materialization outweighs its costs. As the number of candidate queries for materialization is exorbitant, the queries are typically organized into groups called views. A view is the set of all 1-cell queries such that for each dimension, each query requests information at the same level in the hierarchy. A decision must therefore be made as to which views should be materialized. A key to this decision is knowing how long a particular query will take to execute.
One abstract solution proposed to estimate how long a particular query will take to execute is to look solely at the number of “affected records”, i.e., the number of existing and missing tuples in the materialized view that correspond to the tuples in the query view. While in the abstract this would appear to make sense, in actual practice it results in wildly incorrect estimations. This is due to many factors. One factor is that in real operation, only a small percentage of “affected records” would need to be examined when running a query, due to the sparsity of the data. Another factor is that the order in which the records are stored can have a dramatic effect on the speed at which a query is run. If for example, all of the required records are very close to each other, it may only require one page fetch in order to grab all of them. What is needed is a solution that provides for a better estimate of the time needed to run a particular query in a multidimensional database.