Sort is a pervasively used operator in database queries. Usually, the sort algorithm in databases has a good performance complexity (also known as computational or time complexity) on average and worst input cases. Another attribute of a database sort is that one cannot assume that the entire input will fit in memory, or that the size of input is exactly known. For this reason, a database sort usually includes a merge phase to merge subsets of a result that are separately sorted.
Often, the input data has certain characteristics. One of the characteristics can be that the input records are almost sorted (sometimes referred to as clustered). This can be because that data is stored in such a way or previous operations in the query processing provide such an input. The attribute of such clustering could be known before a sort happens via, for example, database statistics. For example, in some databases, high clustering implies that data accessed by index key order is somewhat sorted.
Existing approaches that have good average/worst complexity do not generate the best performance for input records with such clustering characteristics. There are algorithms that are efficient for sorted input, such as an insertion sort and a library sort, etc. However, such algorithms require the entire input to fit in memory. The library sort also allocates a larger amount of extra memory to avoid its worst case sort complexity of O(n2), which often is not realistic for databases and is not scalable for large data sets. Some algorithms also use extra memory to reduce computational (time) complexity.
Also, a database management system (DBMS) engine can maintain some statistics on data in tables. The statistics can be used for selecting an efficient query execution plan, and the statistics provide some hints on the characteristics of input. However, existing approaches do not use the statistics for database query sort or for specializing a query sort. Consequently, the “one size fits all” sorting algorithm does not perform optimally since it cannot pick the best sorting strategy. For example, existing approaches for sorting data do not handle data that is clustered (that is, almost sorted) which is indicated by a “cluster ratio of an index” in a special way.
Additionally, existing approaches lack the use of statistics to launch a more efficient algorithm for an “almost sorted input” data. Some existing approaches include algorithms that can achieve O(N) complexity (O is referred as “Big O” notation, as would be appreciated by one skilled in the art) on the best case input sequence, but they require all input data to fit in memory, some have O(N2) complexity in the worst case, and some sort algorithms use significantly more extra memory (as much as N times more) to maintain O(N) complexity.
Consequently, there is a need for an approach that can handle an ideal input most efficiently and a non-ideal input well without consuming more memory than the operation is allowed.