Sorting long sequences of numbers has been an important task for many applications, such as searching, paring, uniqueness determination, frequency distribution algorithms, and sparse matrix algebra. Merge sort is one of the sorting algorithms that work well with very long lists or sequences of numbers. A merge sorter can be efficiently implemented with serial data storage technologies that store and read one data entry at a time, such as commercial memory integrated circuits or chips.
A conventional merge sorter can be used to sort long sequences of numbers by using a recursive divide-and-conquer approach. The merge sorter divides the sequence into two shorter subsequences of equal or near-equal length. These two subsequences are sorted independently. The sorted subsequences are then merged to produce the sorted result. The two subsequences can also be further divided into still shorter subsequences, then sorted and merged recursively using the same merge sort algorithm, to produce the sorted result. The process of dividing subsequences into still shorter subsequences can continue until each subsequence becomes of atomic length (i.e., a length equal to one number).
FIG. 1 shows an example of conventional merge sorting, in which 16 data items 10 are sorted in four steps 12, 14, 16, and 18. On the bottom row, the sequence to be sorted has been divided into 16 sequences having a length equal to one. Each step 12, 14, 16, and 18 merges pairs of sorted sequences (referred to as a 2-way merge sort). The fourth step 18 produces the final sorted result 20. The merge sort algorithm for conventional merge sorting can be implemented with a conventional general-purpose processor or digital signal processor working with random access memory. Where the length of the sequence to be sorted is n, this merge sort requires processor cycles of order n log2n and the number of memory locations of order 2n.
For many practical applications, the time to complete the sorting is important. When the sequence is relatively short, simple hardware accelerators can be designed to do the sorting quickly. For example, the entire tree-shaped recursive sorting structure for merge sorting shown in FIG. 1 can be embedded in a custom hardware accelerator, such as a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC). Such hardware accelerators can provide the sorted result in order of n+log2n clock cycles with order of n−1 merge sort processing nodes, if maximum parallel processing architecture is used.
However, when an application requires sorting long sequences (i.e., having thousands, hundreds of thousands, millions of data items), the silicon area of a single chip is unlikely to be sufficiently large to implement the entire merge sort tree. And although multiple chips could be used to implement the entire merge sort tree, multiple chips generally increase the size, weight, power, and cost of the hardware. To minimize the size, weight, power, and cost of the hardware, it is often desirable to use one (or few) processor chips with commercial static or dynamic memory chips that provide high density at low cost. Notwithstanding, building a parallel processing solution around such memory devices is difficult because commercial memory chips are usually accessed serially, one byte or word of data at a time.