The present invention related generally to the efficient bundle sorting.
External memory sorting is an extensively researched area. Many efficient in-memory sorting algorithms have been adapted for sorting in external memory such as merge sort, and much of the recent research in external memory sorting has been dedicated to improving the run time performance. Over the years, numerous authors have reported the performance of their sorting algorithms and implementations (cf [Aga96, BBW86, BGK90]). We note a recent paper [ADADC+97] which shows external sorting of 6 GB of data in under one minute on a network of workstations. For the problem of bundle sorting where k less than N/B we note that our algorithm will reduce the number of I/Os that all these algorithms perform and can hence be utilized in benchmarks. We also consider a more performance-sensitive model of external memory in which rather than just counting the I/Os for determining the performance, there is a reduced cost for sequential I/Os compared to random access I/Os. We study the tradeoffs there, and show the adaptation in our bundle sorting algorithm to arrive at an optina algorithm in that model. We also note that another recent paper [ZL98] shows in detail how to improve the merge phase of the external merge sort algorithm, a phase that is completely avoided by using our in-place algorithm.
In the general framework of external memory algorithms, Aggarwal and Vitter showed a lower bound of xcexa9((N/B)logM/Bk(N/B)) on the number of I/Os needed in the worst case for sorting [AV88, Vit99]. In contrast, since our algorithm relies on the number k of distinct keys for its performance, we are able to circumvent this lower bound when k less than  less than N/B. Moreover, we prove a matching lower bound for bundle sorting which shows that our algorithm is optimal.
Finally, sorting is used not only to produce sorted output, but also in many sort-based algorithms such as grouping with aggregation, duplicate removal, sort-merge join, as well as set operations including union, intersect, and except [Gra93, IBM95]. In many of these cases the number of distinct keys is relatively small and hence bundle sorting can be used for improved performance. We identify important applications for bundle sorting, but note that since sorting is such a common procedure, there are probably many more applications for bundle sorting that we did not consider.
Many data sets to be sorted consist of a limited number of distinct keys. Sorting such data sets can be thought of as bundling together identical keys and having the bundles placed in order; we therefore denote this as bundle sorting. We describe an efficient algorithm for bundle sorting in external memory that requires at most c(N/B)logM/Bk disk accesses, where N is the number of keys, M is the size of internal memory, k is the number of distinct keys, B is the transfer block size, and 2 less than c less than 4. For moderately sized K this bound circumvents the "THgr"((N/B)logM/B(N/B)) I/O lower bound known for general sorting. We show that our algorithm is optimal by proving a matching lower bound for bundle sorting. The improved running time of bundle sorting over general sorting can be significant in practice, as demonstrated by experimentation. An important feature of the new algorithm is that it is executed xe2x80x9cin-placexe2x80x9d, requiring no additional disk space.
The present invention discloses a method of sorting data sets including a predetermined number of distinct keys. The method is comprised of, for example, two steps. The first step is comprised of bundling the data sets where substantially identical keys, having substantially identical key values, are bundled together. The second step is comprised of ordering the bundles in a predetermined order, with respect to the order defined by the substantially identical key values for each bundle. The method is performed preferably using external memory.