Sorting algorithms may be used at different stages in many data processing systems. In many applications, the efficiency of the sorting algorithm used determines the throughput and the execution speed of the data processing systems. Methods and algorithms for implementing high speed sorting in hardware are often based on Batcher's Odd/Even sort algorithm or Bitonic sort algorithm as described in “Sorting Networks and their Applications,” K. E. Batcher, Proceedings of AFIPS Spring Joint Computing Conference, Vol. 32, 307-314, 1968.
Some sorting algorithms such as Quicksort and Heapsort that are efficient for software implementation are not suitable for hardware implementation because they have high algorithmic complexity and the execution may be limited to a single comparison operation at a time. Simpler sorting algorithms, which utilize the parallelism available in hardware implementation, perform better than these complex algorithms in hardware implementations.
The Batcher's Odd/Even sort algorithm is based on Merge sort and is data independent, i.e., the same comparisons are performed regardless of actual data. Merge sorting may be normally done by sorting its two halves and then merging the two sorted halves. In case of sorting N elements, Batcher's algorithm has a complexity of the order of N×(log N)2 and latency of (log N)2 because of the logic depth. Logic depth in a digital circuit is the maximum number of basic gates (AND, OR, INV, etc.) a signal needs to travel from source flip-flop to destination flip-flop.
FIG. 1 shows the application of Batcher's Odd/Even sorting algorithm for sorting four elements. In-place sorting may be easily performed using comparators and multiplexers.
There are other sorting algorithms based on Merge sort, such as Bitonic sorting and Shell sorting algorithms that have similar complexity of N×(log N)2 for sorting N elements. However, Batcher's Odd/Even merge sorting algorithm requires the fewest comparators when compared to Bitonic sorting algorithm and Shell sorting algorithm.
The complexity of Batcher's Odd/Even sorting algorithm increases rapidly with the number of elements to be sorted. For large values of N, excessive parallel comparisons may have to be performed. One of the methods to overcome this drawback is to group N values into disjoint sets of fewer elements and use resource-sharing techniques to reduce the complexity at the cost of throughput reduction. To operate at higher clock frequency, a pipelining technique may be used to reduce the critical path delay due to the logic depth. Registering intermediate results at each stage introduces latency. This method produces high throughput only when sorting independent N elements at each iteration. However pipelining may not be suitable for sorting progressive N inputs because each iteration result has to be merged with the previous sorted results. Pipelining delay may have a direct impact on the throughput.
The Insertion sorting method uses cascaded sorting units. A sorting unit comprises basic compare and swap units organized in such a way that input data is sorted as it streams through the pipeline. A single such sorting unit is shown in FIG. 2. Each sorting unit is connected to its two neighbors and to the new input element Rin. Let the data present in sorting unit X be denoted by Rx. Each unit retains the smaller of Rx and Rin and shifts the larger of the two to its neighbor sorting unit that follows it in the cascade. At the end of insertion of all N elements, the first unit from the last unit in the cascade has the minimum value. The expression Rx<Rx+1 is true at every time instance.
The structure is easily scalable and requires minimal control circuitry to control the data movement. For example, to select M most significant elements out of N elements, M basic Insertion sort units are cascaded as shown in FIG. 3. Prior to the insertion process, registers within each sorting unit are initialized to a maximum value that they can hold. Insertion of one element from the input data queue takes place at a time. Inserting an element into any of the registers is equivalent to selectively placing the new input element into the set of M most significant elements. After insertion, one of the elements out of previously selected M elements may be discarded. This process continues until all the N elements present in the input data queue are selectively inserted into the array of M sorting units. At the end, registers R1 through RM hold the M most significant elements.
The above Insertion sort method is capable of selecting M most significant elements from the incoming elements. The total number of elements N may be finite or the input elements may be arriving continuously in a streaming manner. The method continuously selects the M most significant elements from all the input elements at any given time and therefore it is referred as streaming sorter. However, the above architecture is capable of inserting only one element at a time. This method takes N clock cycles to sort N input elements.
Each insertion operation involves comparison of Rin with the elements present in each sorting unit, i.e., M comparisons. Note that as the Insertion process progresses, each element is inserted into an array that is already partially sorted. Hence, most of the comparison operations performed are redundant. At the end of N element insertion, a total of N*M comparisons may be performed.
Selecting M most significant elements out of N elements is a common problem faced in many data processing systems. In a case where N is a small quantity, Batcher's Odd/Even sorting algorithm may be used to obtain the desired performance. For large values of N, Insertion sort logic shown in FIG. 3 may have simpler hardware. However, it only accepts one input element at a time. In many applications, it is required to extract the M smallest or largest elements from a set of N elements. In general, the total number of elements N in a set may be infinite in theory or very large in practice. The number of smallest or largest elements M may be generally much smaller. A method and apparatus are disclosed that enable high throughput and lower complexity streaming sorter.