Processing for sorting an array consisting of a large quantity of data elements according to the value of a key included in each data element is employed in many applications. Approaches for carrying out such a way of sorting are generally classified into two types. One of the approaches is to sort data directly. The other is to pair the key value of each data element with an index representing the position of the data element within an array and rearrange the order of the actual data elements according to the result of sorting such pairs. In the latter approach, all the pairs are sorted sequentially and then the actual data elements are rearranged based on indices arranged in order.
For example, a technique for implementing merge sort is disclosed in JP2000-56947A. In the technique, internal sorting is performed in each of input nodes for data to be sorted, which are distributed and stored in input local disks, and the results of the internal sorting are stored as multiple sorted sequences in a shared disk connected between the input nodes and an output node. When the output node receives a merge instruction from all of the input nodes, it reads and merges the sorted sequences from the shared disk before outputting the result of the overall sorting of the entire input data to an output local disk.
The above-mentioned approach of direct data sorting requires repetitive copying of actual data within the memory during sorting, leading to a large overhead associated with memory copy. Additionally, since keys are distributed in the memory, discontinuous memory accesses to keys occur, making it difficult to adopt an accelerating technique employing Single Instruction Multiple Data (SIMD) instructions for processing multiple data with a single instruction. In contrast, the aforementioned technique for rearranging data based on the results of pair sorting is suited for use with SIMD instructions because it does not require movement of actual data during sorting and also it only involves simple sorting of integers based on keys. However, in the final process of sorting pairs and rearranging data elements, random accesses that directly access data locations are performed in parallel. Consequently, an enormous number of cache misses or absence of required data in cache memory occur, resulting in an increased processing time. In addition, since a large number of memory accesses take place in parallel in the final process, the memory bandwidth during access from a central processing unit (CPU) to the memory creates a bottleneck and the effects of the SIMD technique cannot be expected.