Sorting is germane to many algorithms in computer science, and software applications frequently rely upon efficient sorting primitives for good performance. A top-down, divide-and-conquer sorting method operates by repeatedly applying a partial ordering to a sequence of keys, creating two or more bins, which can then be sorted independently. More particularly, the simplest top-down sorting approaches operate by recursively partitioning the input until each bin contains only a single key (and is thus trivially sorted).
Most practical implementations switch to another “block-sorting” method when the bin size falls below a certain threshold B. The block-sorting component is typically designed and optimized for problems that fit entirely within the storage resources of a single processor core (viz., registers, cache, shared memory, etc.).
To provide good load-balance amongst parallel processing elements, top-down approaches performed across multiple threads and/or processor cores generally strive to construct uniformly-sized bins for block-sorting, regardless of input size and distribution. Otherwise it is possible for a majority of processing elements handling small or average-sized bins to quickly perform their partitioning responsibilities and then wait idly for a small minority to process much larger bins, thus underutilizing processing resources.
To these ends, comparison-based implementations (e.g., sample sort, quicksort, etc.) have traditionally focused on choosing a dynamic set of “splitting points” that approximate the input sequence's key distribution. This can be done, e.g., by sampling the input keys, sorting the samples, and then selecting splitters from the sorted samples at regular intervals. However, for processor genres having wide parallelism per core (e.g., deep multithreading, SIMD/SIMT/vector, etc.), dynamically adjustable splitting points may be difficult to implement due to the complexities of cooperation between fine-grained parallel processing elements.
Furthermore, such parallel processor genres can incur steep performance penalties for bin sizes that fall short of the ideal block-sorting threshold B. That is, processor efficiency is maximized when the utilization of fast, local storage resources are maximized. In some cases, under-filled bins are caused by 1) non-uniform key distributions that result in uneven bin sizes, and 2) mismatches between the input problem size N and the bin-splitting factor S. In particular, “top-down” sorting operations performed in parallel using the most significant digit (MSD) radix sorting method are prone to producing bin sizes that are substantially smaller than the block-sorting threshold B.
One commonplace method of achieving final block sizes approaching B elements is to dynamically adjust the splitting factor S for each recursive partitioning step. However, it is often difficult to dynamically adjust the splitting factor on such parallel architectures due to the complexities of cooperation between fine-grained parallel processing elements.