1. Field of the Invention
The present invention relates to the field of database systems, and more particularly to a system and associated method for efficient adaptive clustering of parallel data when loading a data structure containing data that is clustered along one or more dimensions, such as a multi-dimensional clustering (MDC) table.
2. Description of the Related Art
Aspects of loading data into a table using data clustering are discussed in U.S. Pat. No. 7,080,206, which is incorporated herein by reference.
In database terminology, a multi-dimensional clustering (MDC) table is a table in which data is logically clustered (and sometimes also physically clustered) on disk according to a specified key referred to as the “dimension key.” Each distinct value in the domain of key values represents a “cell” of the MDC table. On disk, each cell comprises one or more fixed size “blocks,” where each block may only contain data for one cell. It is not necessary for all blocks associated with a cell to be contiguously stored on disk.
For example, consider an MDC table SALES having the following definition: SALES (INT STORE_ID, INT CUST_ID, DATE SALE_DATE, INT SALE_MONTH GENERATED ALWAYS AS MONTH(D)) DIMENSIONS(STORE_ID, SALE_MONTH)
The DIMENSIONS clause of the SALES table defines the dimension key that indicates how the data is to be clustered on disk. In this example, data for the SALES table is to be clustered by STORE_ID and SALE_MONTH. The physical blocks on disk for this table may appear as follows: block1 (contains data for cell (1, 1), i.e., STORE_ID “1” and month “1”): 1, 20, 2002/01/01, 11, 21, 2002/01/03, 11, 21, 2002/01/04, 11, 22, 2002/01/05, 1: block2 (contains data for cell (2, 1), i.e., STORE_ID “2” and month “1”): 2, 43, 2002/01/01, 12, 10, 2002/01/02, 12, 52, 2002/01/15, 1: block3 (contains more data for cell (1, 1), i.e., STORE_ID “1” and month “1”): 1, 25, 2002/01/15, 11, 31, 2002/01/15, 11, 32, 2002/01/15, 11, 45, 2002/01/16, 11, 12, 2002/01/16, 1:
Each comma-delimited line above denotes a single row, groups of which may form fixed-size pages (e.g., sub-blocks or other sub-divisions) within fixed-size blocks.
Table partitioning is a data organization scheme in which table data is divided across multiple storage objects called data partitions or ranges according to values in one or more table columns. Each data partition is stored separately. These storage objects can be in different table spaces, in the same table space, or a combination of both.
Table data may be partitioned as specified in the PARTITION BY clause of the CREATE TABLE statement. The columns used in this definition are referred to as the table partitioning key columns.
For example, the following creates a table customer where rows with 1_shipdate>=‘01/01/2006’ and 1_shipdate<=‘03/31/2006’ are stored in table space ts1, rows with 1_shipdate>=‘04/01/2006’ and 1_shipdate<=‘06/30/2006’are in table space ts2, etc.
CREATE TABLE customer (1_shipdate, 1_name CHAR(30))
IN ts1, ts2, ts3, ts4, ts5
PARTITION BY RANGE(1_shipdate) (STARTING FROM (‘01/01/2006’)
ENDING AT (‘12/31/2006’) EVERY (3 MONTHS))
A challenge that arises when large quantities of data are bulk-loaded into an MDC or partitioned table is ensuring favorable input/output characteristics (i.e. minimizing read/write operations to disk as much as possible) in the face of an unknown distribution of input data. The term “unknown distribution” as used herein refers to the absence of any prior knowledge by the input data loading algorithm of the degree to which the input data is already clustered with respect to an operative dimension key.
At least two approaches to loading data into an MDC table are known. These are referred to as “Algorithm 1” and “Algorithm 1_”.
The steps of Algorithm 1 may be summarized as follows: 1. Read all input data rows (i.e. records) from disk. As each row is read, perform an in-memory “bucket sort” of the row according to its dimension key. Whenever a block is filled, write it to disk. 2. When all rows have been processed, write to disk any partially filled blocks that are still in memory.
Algorithm 1 exhibits optimum input/output (I/O) characteristics regardless of input data distribution because no blocks are written to disk prematurely during loading, thus no blocks need to be read back from and re-written to disk in order to be updated. If input data is well clustered, the amount of memory needed by Algorithm 1 will be minimal, since blocks in memory will be repeatedly filled and written to disk (thus vacating memory for re-use), with few incomplete blocks residing in memory at any given time during loading.
As input cell data becomes more and more randomly distributed throughout the input data, however, Algorithm 1 requires more and more memory to maintain its optimum IO characteristics. In the worst case, if N cells are represented in the input data, N* blocks of memory are required. If N is large, the amount of memory required may be infeasible for practical implementation.
Algorithm 1_modifies the approach of Algorithm 1 slightly to accommodate a memory constraint. When a predetermined upper limit of memory is reached and space for a new block is needed, Algorithm 1_chooses a partially filled block in memory (perhaps based on age) and writes, or spills, the partially filled block to disk. An example of an arbitrary buffer replacement strategy is “last recently used” approach known in the art. In this approach, when data is perfectly clustered, only one block-size chunk of memory is required to store a block that is repeatedly filled and written to disk.
When data is well clustered but not perfectly clustered, the likelihood of further input data rows having the dimension key of a partially filled block that is spilled to disk is low, although not zero. It is therefore improbable that a partially filled block that has been spilled to disk will ever require updating in the future, and near-optimum I/O characteristics may be achieved.
An alternative approach to the two algorithms described above involves sorting all the input data according to the dimension key prior to loading. Sorting the data will result in the data being perfectly clustered according to the dimension key. When data is randomly distributed in the input file, sorting is likely (on average) to produce the best I/O characteristics possible.
However, there are at least two drawbacks to pre-sorting the data. First, when the input data is already well clustered, sorting may constitute a significant waste of processing. Second, sorting may not only cluster the data according to the dimension key but may also order all blocks according to the dimension key. Since blocks associated with a cell do not need to be stored contiguously on disk, such extra processing constitutes an unnecessary waste of resources.
Additionally, there are at least two more drawbacks when using sorting to cluster the data. In most cases, sorting requires additional I/O. Unless the entire input can be fit into main memory, there will be a need to spill partially sorted runs and read them back in during a merge phase (this is an additional performance penalty). In addition, the sorting approach doesn't parallelize easily. One has to wait for the sort to process the entire input before the actual loading process can start. Consequently, this limits the ability to pipeline the operations.
One approach to these problems is described in U.S. Pat. No. 7,080,206, in which a system for adaptively loading input data into a multi-dimensional clustering table is disclosed. In this approach, first-level clustering agents cluster data into lists of rows. The lists generated by the first level clustering agents are clustered.
Data generated by the first level clustering agents is passed to a second-level clustering agent for processing. The second-level clustering agent processes the lists from each first-level clustering agent sequentially, attempting to cluster the data into blocks that can be written to secondary storage. The data being clustered is stored in cache as more data is processed. As the cache memory fills, it is victimized, with the contents of the cache written to disk. When additional data matching the clustering characteristics of previously victimized data, the previously victimized data is retrieved from disk and the clustering in the second-level clustering agent continues.
While the method described in U.S. Pat. No. 7,080,206 does cluster input data for loading into a table, large data streams with like cell values distributed diffusely within the data file are processed inefficiently. Repeated victimization and retrieval from disk by the second-level clustering agent results in a bottleneck in the input/output system that slows the load process.