One way to decrease the time it takes for a computer system to perform a task involves splitting the task up into subtasks and executing the subtasks in parallel using a set of concurrently executing processes. One example of a task that is often performed in parallel within a database system is a sort operation. During sort operations, the data items that belong to a set of data are arranged in sorted order. Within a relational database system, the data items are typically rows, and the set of data in which they reside is a table.
To be processed in parallel, the data items are distributed among a set of processes that will process the data items. Depending on the operation involved, the distribution may be random, based on values associated with the data items. For example, in a sort operation, the distribution will generally be based on the sort key, where data items with a sort key in a particular sub-range will be sent to a corresponding “bucket.” In this context, a bucket is a destination in a data distribution operation. Thus, the act of distributing the data items to the parallel processes involves establishing a set of buckets and assigning the data items to the buckets.
For parallel operations to work efficiently, the number of data items assigned to each of the buckets should be substantially the same (or “even”) so that each process can start and end the operation at about the same time. Otherwise, one process, after finishing its operation, may become idle waiting for other processes to finish their corresponding operations. In situations in which the workload is not evenly distributed, the benefits of parallelism are diminished.
During parallel operations, data items are typically assigned to buckets based on values contained in the data items. The values used to assign data items to buckets are referred to herein as distribution keys. For parallel sort operations, the distribution keys are typically the values by which the data items are to be sorted. To distribute the data items based on the distribution keys, each bucket is typically assigned a value range, and all data items that have distribution keys that fall within that value range are assigned to that bucket. In particular, the buckets have an order, and the ranges are assigned to the buckets in a monotonically increasing, left-to-right fashion. The way to ensure that work is evenly distributed among the parallel processes involves assigning to the buckets value ranges that cover approximately the same number of data items.
To attempt to evenly distribute data items during a parallel sort operation performed on a set of data, a database server could be configured to read the first N data items in the set of data (or in each partition of a partitioned set of data) that is being fed into the sort. The sort key distribution reflected by the data items collected in this manner can be assumed to reflect the distribution of sort values within the entire set of data. Based on this assumption, ranges may be assigned to the buckets in a manner that would evenly distribute the entire set of data items among the buckets if the sort value distribution reflected by the collected data items accurately reflects the distribution of sort values within the entire set of data.
Unfortunately, the sort value distribution reflected by the first N data items in a set of data does not always accurately reflect the distribution of sort values within the entire set of data. This is especially true, for example, when the data items have been loaded into the set of data in a batch fashion, in which the set of data may contain locally sorted clusters of data items.
Another problem with reading the first N data items of each partition is that N may be too large to be efficient for sorting small sets of data, and too small to be efficient for sorting large sets of data. Specifically, if N is 100 and there are only 200 data items in each partition, then the amount of work performed to determine the ranges to assign to the buckets will be far greater than is justified for the relatively small amount of work of involved in sorting the data. On the other hand, if there are several million data items in the set of data, it may be desirable to perform more work up front to increase the likelihood that the data items will be evenly distributed to the buckets.
Selecting bad data samples causes negative consequences in various circumstances. For example, assume that the sort operation is being performed to find the data of the ten most senior persons in a particular set of data that includes fifty persons of ages from 0 to 99. For illustration purposes, it shall be assumed that four sort processes are to be used to perform the sort operation, and that the ranges assigned to their buckets are 0–24, 25–49, 50–74 and 75-99. In this example, if the data is evenly distributed in each of the buckets, then the data for the ten most senior persons must be in the fourth bucket (the bucket of data of persons having ages from 75–99). Consequently, the sort process for the fourth bucket simply provides the ten desired data items of persons who are the most senior in this fourth bucket, and thus in all buckets.
However, if the data is not evenly distributed in each of the buckets, such as, for example, there is only data for eight persons in bucket four, then bucket three must be sorted and processed to provide data for the next two most senior persons. The eight persons in bucket four and the two most senior persons in bucket three constitute the ten most senior persons in the entire set. If bucket four and bucket three do not provide all of the desired data, then bucket two must be sorted and accessed to provide the missing desired data. Similarly, if bucket four, bucket three, and bucket two do not provide all of the desired data, then bucket one must be sorted and accessed.
Based on the forgoing, it is clearly desirable to provide a better mechanism to evenly distribute data items of a particular set of data to corresponding buckets for use in parallel operations, such as parallel sorts.