Sorting involves placing elements in a certain order. Most commonly, sorting is performed by numerical or lexicographical order. Efficient sorting is important to optimize search and merge operations, which are very common in a large number of computational activities. For example, sorting is one of the most important operations in a database system. The most prominent cases are queries that include an ORDER BY clause, which specifies the sort order for query output. A variety of other operations, such as aggregation, joins or building of indices may also leverage sorting of data to achieve better performance.
One important aspect of sorting in a large database is that the input data may be significantly larger than the available physical memory. Therefore, in-memory sorting cannot be used. To circumvent this problem, database systems use external sort operations (e.g., operations performed on disk). An external sort initially generates sorted subdivisions, also referred to as “runs”. The runs on different disks are then merged to form a final sort result.
Research and commercial development of external sort operations have focused on reducing input/output operations because the input/output cost is typically considered a dominating performance factor. Increasingly, this trade-off is changing. Modern large-scale database systems have very large aggregated input/output bandwidth.
Moreover, in On-Line Analytical Processing (OLAP) applications, the characteristics of data differ from more traditional database applications, such as On-Line Transactional Processing (OLTP). For example, while most sort operations in an OLTP system are executed on only a small number of sorting values, i.e., columns to sort on, it is common for OLAP queries to have tens or even hundreds of sorting values. Often, such a sort on complex data is dominated by CPU cost instead of input/output cost.
In the case where there are many sorting values, the number of columns that need to be considered during comparison is dependent upon the progress of the sorting algorithm. In the beginning, only the first several values are typically compared. However, as sorting progresses, more and more values need to be extracted and compared. From the user's perspective, the sorting algorithm becomes increasingly slower as it progresses. Data extraction and comparison may add significant overhead.
Heap sort is a common approach to sorting. A heap sort relies upon a data structure called a heap, which is a special type of binary tree. Heap sort consists of two phases: a build phase in which data is organized in a heap data structure and a retrieval phase in which elements are retrieved in sorted order. During the build phase a data list is formed into a heap, the root node is guaranteed to be the smallest (or largest) element. The remaining elements in the heap are organized in a way that child elements are always smaller (or greater) than their direct ancestors—this property is referred to as the heap property. During the retrieval phase the root is removed and placed in a sorted list, the heap is rearranged so the smallest (or largest) element remaining moves to the root. The removal of the root node and the successive reorganization of the heap is repeated until all elements have been removed from the heap.
FIG. 1 illustrates processing associated with a heap sort. When a new value arrives (New Value—YES 102) it is processed 104 and inserted into the heap so that the heap property holds (the following examples will only references the smallest value). If necessary, previously inserted values are reorganized 100 to place the smallest value at the root.
If no new value arrives (New Value—NO 102) it is determined whether the process is completed at block 106. If so, (Done—YES), the sorted values are loaded into a final sorted list 108. If not, (Done—NO), then control returns to block 102 until a new value arrives.
These operations are more fully appreciated with reference to FIG. 2, which illustrates a sorted heap. In particular, the heap is sorted by lowest number, relying upon the first value entry and then the second value entry (i.e., (first value, second value)).
Suppose now that a new value (4, 8) arrives. The new value (4, 8) is put in its place, as shown in FIG. 3.
Suppose no new values arrive for processing. The current root value (2, 3) is removed and placed in the final sorted list. The heap is reorganized according to the heap property: (2, 7) is placed at the root and (4, 8) moves up accordingly, as shown in FIG. 4. Next, the root value (2, 7) is removed from the heap and added to the sorted list. Now (4, 4) becomes the new root and (4, 6) moves up, as shown in FIG. 5. This processing is repeated until the heap contains no more values. At this point all elements have been retrieved in the desired sort order. The sort process is complete.
While heap sorting is well known and widely exploited, it suffers from various performance problems. For example, the strict heap reorganization property results in a large number of comparisons, which presents performance problems.
Therefore, it would be desirable to develop a more computationally efficient form of heap sorting.