Various mechanisms exist for sorting large amounts of information on computing devices. Sorting a list of numbers is a fundamental problem of Computer Science, and is probably the most widely used kernel, spanning a wide range of applications. The current trend of processors is to fit more and more cores, thereby increasing the compute power, and improving performance. However, memory bandwidth is not increasing at a proportional rate, and sorting applications become memory bandwidth-bound for list sizes greater than 1-2 million elements, with practical sizes being in the range of 100 million to 1 billion elements.
Simultaneously merging of data on multiple processors has been proposed in the past as a solution to reduce the bandwidth requirements. However, those algorithms have certain computational overhead and have not been applied in practice. In addition, they are not SIMD (single instruction, multi-data) friendly. Furthermore, their scaling to large number of cores on chip multiprocessors is not evident.