Data sorting is a common process which is frequently used in data analyzing in industry and commerce domains.
In general, in a single-processor data processing system as shown in FIG. 1, data sorting mainly involves the three stages: 1) collecting data to be sorted from main storage; 2) using processing core to sort the fetched data; 3) distribute sorted data back to the main storage.
With the semiconductor process approaching its limit, increasing the processing node number in a data processing system compared to improving continuously the capability of a single processor through advance of semiconductor process is more feasible in the near future.
FIG. 2 shows the architecture of a conventional multi-processor system. As shown in FIG. 2, the multi-processor system generally has a plurality of processors operating on a shared main memory, including one core CPU and a plurality of Accelerated Processing Unit (APU).
For example, the Cell Broadband Engine (CBE) is a single-chip multi-processor system, which has 9 processors operating on a shared main memory, including a Power Processing Unit (PPU) and 8 Synergistic Processing units (SPU). Under such system architecture, the CBE can provide outstanding data computing capability. Thus, for data sorting on a large data set, if a multi-processor system such as CBE is used, the performance of sorting process could be significantly improved.
However, in a multi-processor system such as CBE, in order that a plurality of accelerators perform data sorting in parallel on a data set to be sorted, the data set needs to be partitioned evenly to accommodate the size of local storages of the accelerators, therefore, the main process stages comprise: 1) partitioning the data set to be sorted and distributing it from the main memory to the plurality of accelerators; 2) the plurality of accelerators performing sorting in parallel on respective data thereof; 3) converging data sorting results of the plurality of accelerators into the main memory.
However, in the above process, how to specifically partition data set and how to converge data sorting results are yet challenges today.
Furthermore, in general, in a multi-processor system, the capacity of local storage of each accelerator is limited, since it will be very high cost to equip each accelerator with huge local storage. For example, in the CBE, the capacity of local storage of each SPU is 256 KB, which is not sufficient for a large data set.
Thus, if the data set is not well partitioned, then when a plurality of accelerators performing respective sorting tasks in parallel, data swap operation may need to be performed repeatedly between each of the accelerators and the main memory by using DMA operations, a lots of data swap operations will make the main memory operation less efficient, since memory bandwidth between the main memory and the plurality of accelerators generally is limited. For example, in the CBE, the memory bandwidth between SPUs and the main memory can only be maintained at about 25.6 GB/s, which will be shared by 8 SPUs.
In addition, if the data set is not well partitioned, then it is also possible that when the plurality of accelerators performing respective sorting tasks, each of the accelerators needs to perform data communication with other accelerators, which will also cause low sorting efficiency.
Therefore, it should be taken into consideration that if data sorting is performed on a multi-processor system such as CBE, data swap operations between main memory and accelerators and between accelerators should be reduced.
Furthermore, in a typical data sorting algorithm, lots of branch (compare) operations exist; however, for a multi-processor system such as CBE, the ability for branch operation is relatively weak. This is also a problem that should be taken into consideration when performing data sorting on a multi-processor system such as CBE.
Based on the above consideration, there is a need for designing a data partitioning and sorting solution that are suitable for a multi-processor system such as CBE.