The present invention relates generally to data processing environments and, more particularly, to system and methods for improved sorting of data records in those environments.
Perhaps one of the most fundamental tasks to the operation of computers is sorting--the process of arranging a set of similar information into a desired order. While employed in virtually all database programs, sorting routines are also extensively used in many other areas. Common examples include compilers, interpreters, and operating system software. And in many instances, the quality and performance of such software is judged by the efficiency of its sorting techniques. Since sorting methodology plays such an important role in the operation of computers and other data processing systems, there has been much interest in seeking ways to improve existing systems and methods. Historically, techniques for sorting information are divided into three general methods: exchange, selection, and insertion. Each will now be reviewed in turn.
To sort by exchange, a system swaps or "exchanges" out-of-order information until all data members are ordered. Perhaps the best-known example of exchange sorting is the infamous "bubble sort." The general methodology behind the bubble sort is that of repeated comparisons and attendant exchanges of adjacent members. In this manner, the method is analogous to bubbles in water, where each bubble percolates to its proper level.
As shown by the following C language example, a bubble sort method keeps passing through a set of members, exchanging adjacent elements as needed.
______________________________________ bubble(int a !, int N) int i, j, t; for (i = N; i &gt;= 1; i--) for (j = 2; j &lt;= i; j++) if (aj - 1! &gt; aj!) {t = aj - 1!; aj- 1! = aj! = t;} } ______________________________________
When no more exchanges are required, the set is sorted. Observe that the number of comparisons for a bubble sort is always the same; particularly, the two "for" loops will repeat a specified number of times, regardless of when the list is ordered. This observation may be generalized as follows: the bubble sort will always perform 1/2(n.sup.2 -n) comparisons, for "n" number of elements to be sorted. In other words, the outer loop executes n-1 times, while the inner loop executes n/2 times.
Having considered the number of possible comparisons, next one should consider the possible number of exchanges required by the bubble sort. For an already sorted list (best case), no exchanges are required (i.e., the number of exchanges equals zero). As a list becomes less ordered, however, the number of elements that are out of order approaches the number of comparisons. The end result is that the execution time approaches a multiple of the square of the number of elements, making the bubble sort unusable for large sorts.
A selection sort, perhaps one of the simplest sorting algorithms, proceeds as follows. A system continually chooses or "selects" a data member from one extreme of possible values, such as the lowest-value member, until all members have been selected. Because the system always selects the lowest-value member from those remaining, the set will be ordered from lowest to highest-value member when the process is completed. The sort may be implemented by the following C code:
______________________________________ selection(int a !, int N) int i, j, min, t; for (i = 1; i &lt; N; i++) { min = i; for (j = i + 1; j &lt;= N; j++) if (aj! &lt; amin!) min = j; t = amin!; amin! = ai!; ai! = t; } } ______________________________________
As shown by this code snippet, the method first finds the lowest-value element in an array and exchanges it with the element in the first position. Next, the second smallest element is located and exchanged with the element in the second position. The process continues in this way until the entire array is sorted.
Like the bubble sort, the outer loop above executes n-1 times, while the inner loop executes 1/2(n) times. Thus, the technique requires roughly n.sup.2 comparisons, making it also too slow for processing a large number of items.
In a sort by insertion, the system examines a data member and places or inserts that member into a new set of members, always inserting each member in its correct position. The sort is completed when the last member has been inserted. This sort technique may be implemented as follows:
______________________________________ insertion(int a !, int N) int i, j, v; for (i = 2; i &lt;= N; i++) { v = ai!; j = i; while (aj - 1! &gt; v) {aj! = aj - 1!; j--; } aj! = v; } } ______________________________________
Unlike the previous two sorting techniques, however, the number of comparisons that occur with this technique depends on the initial order of the list. More particularly, the technique possesses "natural" behavior; that is, it works the least when the list is already sorted and vice versa, thus making it useful for lists which are almost in order. Also, the technique does not disturb the order of equal keys. If a list is sorted using two keys, the list will remain sorted for both keys after an insertion sort.
A particular concern for any sort method is its speed, that is, how fast a particular sort completes its task. The speed with which an array of data members can be sorted is directly related to the number of comparisons and the number of exchanges which must be made. Related to the characteristic of speed is the notion of "best case" and "worst case" scenarios. For instance, a sort may have good speed given an average set of data, yet unacceptable speed given highly disordered data.
One technique for reducing the penalty incurred by exchanging full records is to employ a method which operates indirectly on a file, typically using an array of indices, with rearrangement done afterwards. In this manner any of the above sorting methods may be adapted so that only n "exchanges" of full records are performed. One particular approach is to manipulate an index to the records, accessing the original array only for comparisons. In other words, it is more efficient to sort an index to the records than incurring the cost of moving large records around excessively.
Since all of the simple sorting techniques above execute in n.sup.2 time, their usefulness for sorting files with a large number of records is limited. In other words, as the amount of data to be sorted increases, the execution speed of the technique becomes exponentially slower, at some point, too slow to use. Thus, there has been great interest in developing improved techniques for sorting information.
Perhaps the best-known improved sorting technique is quicksort, invented in 1960. Quicksort's popularity is due in large part to its ease of implementation and general applicability to a variety of situations. Based on the notion of exchange sorting, it adds the additional feature of "partitions", which will now be reviewed.
With quicksort, a value or "comparand" is selected for partitioning the array into two parts. Those elements having a value greater than or equal to the partition value are stored on one side, and those having a value less than the partition value are stored on the other side. The process is repeated for each remaining part until the array is sorted; as such, the process is essentially recursive. The quicksort "divide-and-conquer" method of sorting may be implemented by the following recursive function:
______________________________________ quicksort(int a !, int 1, int r) int i; if (r &gt; 1) { i = partition(1, r); quicksort(a, 1, i - 1); quicksort(a, i + 1, r); } } ______________________________________
Quicksort is not without its disadvantages, however. Being recursive in nature, the technique usually requires that a significant amount of stack-based memory be reserved. Moreover, the technique, which is particularly sensitive to long common substrings, exhibits nonlinear behavior. This nonlinearity may be summarized as follows: c.sub.1 *n*log 2(n). The constant c.sub.1 is approximately proportional to the average compare length, that is, the average point where two records differ. In the case of many common substrings in the data, or just many duplicates, the average compare length is fairly large, thus affecting the total sort time accordingly. In particular, every character in every record in the first "average compare length number of" characters is used an average of log 2(n) times.
The basic theory and operation of these and other sorting and search techniques are well documented in the technical and trade literature. A general introduction to the topic may be found in Sedgewick, R., Algorithms in C, Addison-Wesley, 1990. A more detailed analysis of the topic may be found in Knuth, D., Sorting and Searching, The Art of Computer Programming: Vol. 3, Addison-Wesley, 1973.
More advanced techniques are described in the patent literature. For instance, Sorting Method and Apparatus, U.S. Pat. No. 4,809,158, describes a method for sorting records where the records are placed in various bins depending on the character on which the record is presently being sorted. The bins, in turn, are linked together. The records from a first bin are then sorted again on the next letter of the record, and so on until the records are fully ordered and placed in a "Done" area. Next, records from the second bin are put into final order and placed into the "Done" area, being appended to the already sorted records from the first bin. The process continues taking records from successive bins, ordering the subgroup, and appending it to the "Done" group, until the entire collection is sorted. Despite advantages over quicksort technique, the described method has a pronounced limitation. In particular, the linking together of records incurs a substantial cost in terms of memory requirements. For instance, sorting one million records would require an extra four megabytes of memory, if linked.
The disclosure of each of the foregoing references is hereby incorporated by reference.
The sorting methodology employed in many commercial products today is based on the above-described quicksort approach. A particular problem with quicksort (and its variants), however, is the number of comparisons required to properly sort data, especially as the number of data records in a set grows in size. In large scale databases, such as those employing millions of records, a corresponding large number of comparisons is highly problematic. This basic problem is exacerbated by the tendency to support data fields storing longer key values (e.g., character or text strings of 256 bytes or larger), as well as the problems attendant to comparisons across different types of data. In this regard, even an individual comparison (i.e., of one data record to another) often entails a high number of comparisons of the data values (e.g., characters, bytes, and the like) which comprise those data values. Still further, a comparison might entail multiple keys, thus requiring a system to involve multiple columns in the comparison operation.
Another aspect of quicksort-based routines is their recursive nature. Since the quicksort methodology is designed to recursively invoke itself, a system must have sufficient stack space available for supporting such a recursive operation. For powerful (and expensive) work stations, such a requirement does not pose a particular problem. More modest systems, such as low-end work stations or personal computers, on the other hand, might have insufficient memory resources to support a recursive operation once the underlying data set reaches a large size.