Database applications are commonly used to store large amounts of data. One branch of database applications that is growing in popularity is Online Analytical Processing (OLAP) applications. OLAP involves the use of computers to extract useful trends and correlations from large databases of raw data. It may involve consolidating and summarizing huge databases containing millions of items (e.g., sales figures from all branches of a supermarket chain) and making this data viewable along multidimensional axes, while allowing the variables of interest to be changed at will in an interactive fashion. As such, the processing and memory load on OLAP servers is very high.
Typically, a multidimensional database stores and organizes data in a way that better reflects how a user would want to view the data than is possible in a two-dimensional spreadsheet or relational database file. Multidimensional databases are generally better suited to handle applications with large volumes of numeric data and that require calculations on numeric data, such as business analysis and forecasting, although they are not limited to such applications.
A dimension within multidimensional data is typically a basic categorical definition of data. Other dimensions in the database allow a user to analyze a large volume of data from many different perspectives. Each dimension may have a hierarchy associated with it. For example, a product group dimension may have a sublevel in the hierarchy that includes entries such as drinks and cookies. The drinks entry may then have its own sublevel of individual product identifiers for each type of drink sold. Each hierarchy may have any number of levels.
For each event, measures may be recorded. In a sales example, this may include sales amount, product identifier, location of purchase, etc. This raw information is known as input level data. This data may be stored in a multidimensional cube. This cube may be extremely large given the number of dimensions and variables typical to businesses, but it may also be extremely sparse, in that there are large gaps where no information is stored. This is because only a small percentage of the possible combinations of variables will actually be used (e.g., no customer is going to purchase every single item in stock over their lifetime, let alone in a single day).
It is becoming increasingly common to have databases with a large number of dimensions, anywhere from 10 to 35 or more dimensions. Unfortunately, when dealing with that many dimensions, it is difficult for a user to visualize or understand relationships or patterns within the data. Most users cannot visualize anything more than a few dimensions. Additionally, sparsity only adds to this problem, as when the data is sparse, most views, especially at the more granular levels, reveal cells that are mainly empty.
There are several ways to reduce the apparent dimensionality of the data in order to facilitate users' needs to understand and analyze the data, depending upon how much is known. If the variables/data/measures of interest to the user are known (and are numeric), it is possible to rank the dimensions in terms of their correlation with changes to the values of those variables. It is then further possible to select only those dimensions of high rank as candidates for display along the axis of a grid interface. This, therefore, presents to the user only the dimensions that are the most likely to aid in their analysis.
However, there are many times when variables are not numeric, or when specific variables of interest are not known. What is needed is a solution that can reduce the apparent dimensionality of the data set, and thus facilitate its comprehension to users, even when variables of interest are not numeric or are unknown.