Data clustering is a widely used technique in data management for storing data in a relational database system. Tuples of data are grouped on the basis of their logical similarity and co-located in nearby storage on a storage device. Data clustering optimizes the number of physical input/output (I/O) operations to reduce access time during processing. Data clustering can be performed in a single dimension when data is grouped using one logical similarity criterion, or in a plurality of dimensions (i.e. multidimensional data clustering (MDC)) when more than one logical criteria for data grouping is used (i.e. multiple dimensions in a data clustering solution. Multidimensional data clustering, driven by business intelligence, online analytical processing (OLAP), and batch application processing, has become more popular in data warehousing.
Although this technology has proven to be useful, it would be desirable to present additional improvements. A cost of providing multidimensional data clustering for more effective data processing can be data storage expansion. More specifically, data clustering is typically performed by logical units or cells where each cell represents a unique value of a clustering key. Each cell is composed of one or more physical storage blocks (if the cell contains data) having a blocking size of one or more pages of memory. Thus if the block size selected is too large or the cell data too scant, the result is a plethora of partially filled blocks and a waste of storage space. Consequently, clustering criteria must be selected carefully for their density and distribution across cells in order to effectively use disk space and avoid space wastage.
The problem of efficient disk space usage is exacerbated in a multidimensional clustering space, where each dimension contributes to the sparsity of the joined space. For example, consider a multidimensional table with clustering criteria that includes query dimensions A, B and C. Dimensions A, B and C may initially (i.e. before data clustering), be stored as a table of data that has sufficient distribution and density so that each of A, B or C would be useful clustering dimensions by themselves, leaving hardly any partially filled blocks. However, when A, B and C are all used as clustering dimension criteria jointly, then each unique combination of A, B and C results in a new cell. At least some and possibly many of the resulting multidimensional cells will necessarily have fewer records per cell than would be the case had the clustering key been composed of only one dimension. The result is cells that are less densely filled resulting in partially filled blocks and therefore in storage expansion.
Data storage expansion typically results in additional expenses related to the cost of acquiring and maintaining the additional physical storage devices. Furthermore, knowledge of the amount of expansion is desirable before physical data clustering is performed. Thus, there is a need for an awareness of the expansion amount for specific criteria to facilitate selection among the criteria. Increased database efficiency can result and at the same time an unsuitable database size can be prevented. The need for such a system has heretofore remained unsatisfied.