This application includes Appendix A, which forms a part of this specification and which is herein incorporated by reference.
This invention relates generally to data processing and, specifically, to a method and apparatus that partitions data in conjunction with, for example, a parallel sorting method.
As data processing has advanced in recent years, the amount of data stored and processed by computer and other data processing systems has grown enormously. Current applications, such as data mining systems and systems that perform data operations on very large databases, often need to process huge amounts of data (called a xe2x80x9cdata setxe2x80x9d). Such large data sets can often be larger than the memory of the computer or computers that process them. For example, current data sets are often in the range of several terabytes (240) or more, and it is anticipated that data sets will be even larger in the future. Current data processing systems require parallel external sorting techniques.
Various conventional methods have been devised to sort very large amounts of data, including data that is larger than the memory of the system doing the sorting. The standard text of Knuth, xe2x80x9cThe Art of Computer Programming, Vol. 3, Sorting and Searching,xe2x80x9d Addison Wesley Longman Publishing, second edition, 1998, pp 252-380 discloses several conventional external sorting methods. In order to perform a parallel sort, it is necessary to determine a set of sort key values that will be used to divide the sorted data between the multiple processes or cpus involved in the sort. This problem is called xe2x80x9cpartitioningxe2x80x9d or xe2x80x9cselection.xe2x80x9d Several conventional parallel sorts use a sampling method to determine the keys for the multiple processes.
As data sets grow ever larger, however, conventional sorting methods are often not fast enough and are not always efficient for all distributions of data. In addition, certain conventional methods do not work when the data to be sorted contains variable length records. What is needed is a new method of parallel sorting that is faster and more efficient that conventional parallel sorting methods and that operates correctly on a wide range of data distributions, as well as variable length records.
An embodiment of the present invention provides a method and apparatus for sorting very large data sets using a parallel merge sort. A described embodiment of the invention operates in a clustered computer system, although it is contemplated that the invention can be implemented for any appropriate distributed (or shared memory) computer system, such as a computer network or the internet. The method of the present invention can also be used to locate database quantiles or to partition other types of keys in near-minimum time (as discussed in further detail below). The method of the present invention can also be used to perform a distribution sort, as described in Appendix A, which is a part of this specification and is herein incorporated by reference.
Given sorted work files S1, . . . , SP, produced by P processes, the described embodiment of the method effectively implements a parallel merge onto respective output partitions O1, . . . , Op of the P processes. Because each of these output partitions Oj has a finite size (1 less than =j less than =p), the invention must quickly determine xe2x80x9csplitting keysxe2x80x9d for each output partition Oj in such a way that the data in the work files will be split between the multiple output partitions Oj without overrunning the size of any of the partitions Oj. Once the splitting keys for each partition are determined, the processes exchange data so that the output partitions of each process contains data between the splitting keys associated with that output partition.
In accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method of parallel sorting a large amount of data, performed by a plurality of processes of the data processing system and comprising: providing, for each process, a work file, each work file containing a respective portion of the data to be sorted, where the data within each work file is in sorted order; determining an initial upper and lower bound associated with each process; sending, by each of the processes in parallel, a plurality of messages to each of the other processes indicating current upper bounds of the sending process to determine an upper bound for the sending process; and performing, by the processes, a merge in which each of the processes creates an output partition containing data within its upper and lower bounds.
Advantages of the invention will be set forth, in part, in the description that follows and, in part, will be understood by those skilled in the art from the description herein. The advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims and equivalents.