In database terminology, a multi-dimensional clustering (MDC) table is a table in which data is logically clustered (and sometimes also physically clustered) on disk according to a specified key referred to as the “dimension key.” Each distinct value in the domain of key values represents a “cell” of the MDC table. On disk, each cell comprises one or more fixed size “blocks,” where each block may only contain data for one cell. It is not necessary for all blocks associated with a cell to be contiguously stored on disk.
For example, consider an MDC table SALES having the following definition:                SALES (INT STORE_ID, INT CUST_ID, DATE SALE_DATE, INT SALE_MONTH GENERATED ALWAYS AS MONTH(D))                    DIMENSIONS(STORE_ID, SALE_MONTH)                        
The DIMENSIONS clause of the SALES table defines the dimension key that indicates how the data is to be clustered on disk. In this example, data for the SALES table is to be clustered by STORE_ID and SALE_MONTH. The physical blocks on disk for this table may appear as follows:                block1 (contains data for cell (1, 1), i.e., STORE_ID “1” and month “1”):        1, 20, 2002/01/01, 1        1, 21, 2002/01/03, 1        1, 21, 2002/01/04, 1        1, 22, 2002/01/05, 1        :        block2 (contains data for cell (2, 1), i.e., STORE_ID “2” and month “1”):        2, 43, 2002/01/01, 1        2, 10, 2002/01/02, 1        2, 52, 2002/01/15, 1        :        block3 (contains more data for cell (1, 1), i.e., STORE_ID “1” and month “1”):        1, 25, 2002/01/15, 1        1, 31, 2002/01/15, 1        1, 32, 2002/01/15, 1        1, 45, 2002/01/16, 1        1, 12, 2002/01/16, 1        :        
Each comma-delimited line above denotes a single row, groups of which may form fixed-size pages (e.g., sub-blocks or other sub-divisions) within fixed-size blocks.
A challenge that arises when large quantities of data are bulk-loaded into an MDC table is ensuring favorable input/output characteristics (i.e. minimizing read/write operations to disk as much as possible) in the face of an unknown distribution of input data. The term “unknown distribution” as used herein refers to the absence of any prior knowledge by the input data loading algorithm of the degree to which the input data is already clustered with respect to an operative dimension key.
At least two approaches to loading data into an MDC table are known. These are referred to as “Algorithm 1” and “Algorithm 1_”.
The steps of Algorithm 1 may be summarized as follows:    1. Read all input data rows (i.e. records) from disk. As each row is read, perform an in-memory “bucket sort” of the row according to its dimension key. Whenever a block is filled, write it to disk.    2. When all rows have been processed, write to disk any partially filled blocks that are still in memory.
Algorithm 1 exhibits optimum input/output (I/O) characteristics regardless of input data distribution because no blocks are written to disk prematurely during loading, thus no blocks need to be read back from and re-written to disk in order to be updated. If input data is well clustered, the amount of memory needed by Algorithm 1 will be minimal, since blocks in memory will be repeatedly filled and written to disk (thus vacating memory for re-use), with few incomplete blocks residing in memory at any given time during loading.
As input cell data becomes more and more randomly distributed throughout the input data, however, Algorithm 1 requires more and more memory to maintain its optimum IO characteristics. In the worst case, if N cells are represented in the input data, N * blocks of memory are required. If N is large, the amount of memory required may be infeasible for practical implementation.
Algorithm 1_modifies the approach of Algorithm 1 slightly to accommodate a memory constraint. When a predetermined upper limit of memory is reached and space for a new block is needed, Algorithm 1_chooses a partially filled block in memory (perhaps based on age) and writes, or spills, the partially filled block to disk. An example of an arbitrary buffer replacement strategy is “last recently used” approach known in the art. In this approach, when data is perfectly clustered, only one block-size chunk of memory is required to store a block that is repeatedly filled and written to disk.
When data is well clustered but not perfectly clustered, the likelihood of further input data rows having the dimension key of a partially filled block that is spilled to disk is low, although not zero. It is therefore improbable that a partially filled block that has been spilled to disk will ever require updating in the future, and near-optimum I/O characteristics may be achieved.
An alternative approach to the two algorithms described above involves sorting all the input data according to the dimension key prior to loading. Sorting the data will result in the data being perfectly clustered according to the dimension key. When data is randomly distributed in the input file, sorting is likely (on average) to produce the best I/O characteristics possible.
However, there are at least two significant drawbacks to pre-sorting the data. First, when the input data is already well clustered, sorting may constitute a significant waste of processing. Second, sorting may not only cluster the data according to the dimension key but may also order all blocks according to the dimension key. Since blocks associated with a cell do not need to be stored contiguously on disk, such extra processing constitutes an unnecessary waste of resources.
Additionally, there are at least two more drawbacks when using sorting to cluster the data. In most cases, sorting requires additional I/O. Unless the entire input can be fit into main memory, there will be a need to spill partially sorted runs and read them back in during a merge phase (this is an additional performance penalty). In addition, the sorting approach doesn't parallelize easily. One has to wait for the sort to process the entire input before the actual loading process can start. Consequently, this limits the ability to pipeline the operations.
What is therefore needed is a method of loading data into multi-dimensional clusters that addresses at least some of the above noted disadvantages. The need for such a system has heretofore remained unsatisfied.