The speed and efficiency at which a computer having a fixed processing capability accomplishes a delineated task is directly proportional to the quantity of data of being processed. To accomplish tasks more quickly, some conventional processing methods partition a data set in a database into a plurality of smaller data sets which can be processed together more quickly than the non-partitioned data set from which they are derived, thereby increasing the speed at which the data is processed. One widely used method for processing data in this manner is to construct a histogram approximation of a data set comprised of a plurality of numbers by partitioning the data set into a plurality of subsets, i.e., tiles, and then calculating the average value of the numbers in each tile, which average values are used for processing purposes.
The three most widely used types of partitions constructed for two-dimensional data sets are: an arbitrary partition, shown in FIG. 1A, which has no restrictions on the arrangement of tiles; an hierarchical partition, shown in FIG. 1B, in which an array is vertically or horizontally separated into two disjointed tiles which are each further hierarchically partitioned; and a p.times.p partition, shown in FIG. 1C, in which the rows and columns are each partitioned into disjointed tiles.
A partitioning metric for evaluating the partition is used to construct the histogram, wherein the metric is selected to be less than a fixed performance constraint value .delta. which varies depending upon the particular task to be performed. An algorithm is then used to determine the partition to be used. Optimal partitioning is obtained by constructing tiles having a minimum variation between the average value of the numbers in the tiles and each of the numbers themselves.
Partitioning of data sets is used for various purposes including, but not limited to, scheduling when a computer will perform various tasks, as well as for selectivity estimation purposes such as determining how many people in a given population data set fall within each one of a plurality of age distributions. For scheduling tasks, .delta. represents the maximum time within which a task is to be performed, while for selectivity estimation purposes, .delta. represents the upper limit on the acceptable error of the result The memory available for processing a data set determines the number of partitions into which the data set can be divided, and thus also serves as a limitation on the selected metric.
Conventional methods for partitioning data sets suffer from significant drawbacks. Specifically, such methods partition the data at very slow speeds, even for small data sets, and require large amounts of computer memory to process the data.