Database applications are commonly used to store large amounts of data. One branch of database applications that is growing in popularity is Online Analytical Processing (OLAP) applications. This typically involves designing databases for fast access. Using specialized indexing techniques, it processes queries that may pertain to large amounts of data and multidimensional databases much faster than traditional techniques.
Typically, a multidimensional database stores and organizes data in a way that better reflects how a user would want to view the data than is possible in a two-dimensional spreadsheet or relational database file. Multidimensional databases are generally better suited to handle applications with large volumes of numeric data and that require calculations on numeric data, such as business analysis and forecasting, although they are not limited to such applications.
A dimension within multidimensional data is typically a basic categorical definition of data. Other dimensions in the database allow a user to analyze a large volume of data from many different viewpoints or perspectives. Thus, a dimension can also be described as a perspective or view of a specific dataset. A different view of the same data is referred to as an alternative dimension.
One drawback to multidimensional databases is that they become sparse in large applications. The sparsity of a database refers to a relative lack of density of the values in the database. The fewer values in a multidimensional database in relation to the number of total cells in the database, the more sparse the database is said to be. Typically, as the number of dimensions grow, so does the sparsity. Sparse databases take up a large amount of space relative to the amount of actual data stored. As such, techniques for reducing the dimensionality of a database to arrive at denser cubes within the database may be utilized. One such technique is called clustering.
In clustering, similar cells having values in them may be grouped into a single region, resulting in a database having a number of regions, wherein each region represents a “dense” portion of the database. This eliminates or at least reduces the need to handle the sparse or empty areas of the database during storage, aggregation, and other functions.
Earlier attempts to form regions utilize time consuming algorithms that examiner the entire data set at once and separate the set into a number of regions. These algorithms require that the entire data set be known “a priori”. This can be a very difficult restriction because data tends to grow incrementally, depending upon the operations within a multidimensional database system. Utilizing existing clustering algorithms means that the clustering of data for the entire data set needs to be re-computed whenever a single data point is added or deleted.
As such, clustering algorithms typically are not used in multidimensional databases as a data-storage structure. The cost of computing these regions for the whole data set during each update outweighs any benefit received from the optimized storage mechanism.
Additionally, the focus in earlier clustering techniques is in forming geometrically compact region. In other words, the decision to include a point in a region is based on its geometric distance from other points within the region. However, the geometry of a multidimensional cube can usually be easily altered. For example, it is typically quite easy to reorganize members of a dimension so that points that were geometrically proximate are now further apart. Thus, prior solutions fail to consider that the geometric distance between points is less important than whether the cross product of the respective dimensions creates a dense population of cells.
FIG. 1 is a graph diagram illustrating an example of a small multidimensional database. Here, a product dimension 100 list product names (Skis, Boots, Poles), a location dimension 102 lists countries (USA, Mexico, U.K., Canada), and a purchases dimension 104 lists monetary benefits (sales, profit). Also depicted in the graph are four points, labeled P1, P2, P3, and P4. P1 represents (Skis, USA, Sales), P2 represents (Poles, USA, Sales), P3 represents (Boots, USA, Profit), and P4 represents (Skis, Canada, Profit). Note that the dotted lines represent projections along the purchases dimension 104.
According to the prior art techniques, the points that are geometrically close to each other are clustered together in a region. P1, P2, and P3 are of equal distance to each other. This results in one region representing P1, P2, and P3, and a second region representing P4. Yet this is due partially to what essentially is an arbitrary ordering of elements within each dimension. For example, FIG. 2 represents the same points if we had chosen to swap Mexico and Canada in the ordering of the location dimension, and Boots and Skis in the products dimension. As can be seen, now P1 and P2 are equal distance apart and P3 and P4 are equal distance apart, resulting in a separate clustering of P1 and P2 in one region and P3 and P4 in the second region. This despite the fact that the underlying data is no different from FIG. 1 to FIG. 2.
Furthermore, the dependence on geometric distance as a criteria for clustering ignores perhaps the most relevant information for efficiency of the system: the information the user is interested in. For example, the regions generated by prior art techniques for FIG. 1 are made even worse in the instance that the user is most interested in Profit numbers, as the regions group the two profit points separately. It would have been more efficient if both profit points were in the same region, given the user's interest in those numbers.
Therefore, what is needed is a clustering solution that does not require that the entire data set be known a priori. Additionally, what is needed is a clustering solution that does not need to use geometric distance as the criterion to form regions.