The present invention relates generally to the field of parallel computation, and more specifically, to a data driven parallel sorting system and method.
A parallel sorting algorithm is an algorithm that improves sorting efficiency using the parallel computation capability of a computer. The parallel sorting is applicable in fields such as database, extraction-transformation-load (ETL), etc. A parallel sorting algorithm typically adopts a divide and conquer approach. That is, a parallel sorting algorithm divides a sequence to be sorted into a certain number of sub-sequences, orders each sub-sequence, and then merges ordered sub-sequences to produce an entirely ordered sequence.
When parallel sorting is used, data is often distributed to multiple partitions. Each partition corresponds to a sorting process which is, for example, a procedure or a thread. For each partition, the sorting process sorts the data that was distributed to the partition. The sorting process of each respective partition is performed in parallel. Then merge sorting is applied to ordered data across all partitions, to complete the sorting of all data. The merge sorting may utilize various contemporaneous merge sorting algorithms, as long as the sort algorithm merges a plurality of ordered sequences into one ordered sequence.
Parallel sorting is often applied in a data warehouse. For example, it may sort input stream data from a plurality of databases residing in a data warehouse. The input stream data is composed of data records, which may be sorted according to a particular field. In such an application, the volume of data can be very large, and it may not be possible to accommodate all the data records in memory at the same time during sorting.