Data clustering is a storage methodology in which like or similar data records are grouped together. Multidimensional clustering (“MDC”) allows data to be ordered simultaneously along different dimensions. MDC is motivated to a large extent by the spectacular growth of relational data, which has spurred the continual research and development of improved techniques for handling large data sets and complex queries. In particular, online analytical processing (OLAP) has become popular for data mining and decision support. OLAP applications are characterized by multi-dimensional analysis of compiled enterprise data, and typically include transactional queries including group-by and aggregation on star schema and snowflake schema, multi-dimensional range queries, cube, rollup and drilldown.
The performance of multi-dimensional queries (e.g. group-by's, range queries, etc.), and complex decision support queries that typically support a significant number of data records, is often improved through data clustering, as input/output (I/O) costs may be reduced significantly, and processing costs may be reduced modestly. Thus, MDC techniques may offer significant performance benefits for complex workloads.
However, for any significant dimensionality, the possible solution space is combinatorially large, and there are complex design tradeoffs to be made in the selection of clustering dimensions. Thus, a database clustering schema can be difficult to design even for experienced database designers and industry experts. A poor choice of clustering dimensions and coarsification can be disastrous, potentially reducing performance rather than enhancing it and expanding storage requirements and associated costs by orders of magnitude. Conversely, a judicious selection of clustering dimensions and coarsification may yield substantial performance benefits, while limiting storage expansion to an acceptable level.
Thus, what is needed is a more systematic and autonomic approach to designing a database clustering schema.