The sorting of records located in a computer's various memory locations, and/or input to a computer, can require the appropriation of a large percentage of the computer's resources during the sorting operation for a prolonged period of time. Similarly, the merging of files requires significant computer resources for a long period of time. For these and other reasons, there is a need to develop sorting and merging methods which minimize both the physical resources which must be dedicated to the operations and the elapsed time.
The sorting process, stated broadly, involves four distinct stages or phases of operation: the initialization phase; the reading and string sorting phase including the alternating stages of reading the records into the computer and sorting the records into strings; the merge and output phase involving merging the strings and writing the records to the designated output file in sorted order; and the clean-up phase. To facilitate an understanding of both the background technology and the present invention, several of the uniquely-defined terms of the art are outlined below:
string sort: a process or subprocess the primary function of which is to take a set of data records and a specified key and rearrange the records to produce a sorted string; PA1 key: a field, or collection of fields, which may reside in each of a collection of computer data records or be appended thereto in accordance with the user specification, the value of which determines the desired order of the records (an example being: the first character of a data record, to be sorted in alphabetical order); PA1 section: a portion of the input file, whose order with respect to the key field is unknown; PA1 string: a collection of records sorted in the order of the key; PA1 merge: a process or subprocess whose primary function is to take multiple strings which were previously ordered on the same key and combine them into a single string ordered on that key; PA1 internal sort: a sort in which all of the records to be sorted are contained within the computer's internal memory at one time; PA1 external sort: a sort in which the space required for all the records to be sorted exceeds the available computer memory space; whereby sections of the input data must be read into the computer, sorted into strings and the strings stored in a temporary file, later to be merged with the other sorted strings; PA1 pass: the merging of some number of sorted strings, which may be less than or equal to the total number of sorted strings; and PA1 multiple pass merge: a merge phase that requires more than one pass in order to completely merge all data, the output of intermediate passes being sent to a temporary file.
A text which details the techniques of computer sorting is The Art of Computer Programming, Vol. 3 subtitled "Sorting and Searching" by Donald E. Knuth, Addison-Wesley Publishing Co., Inc. (Menlo Park, Calif., 1973, the teachings of which are hereby incorporated by reference.
The sorting process steps, as mentioned above, commence with an initialization phase. Initialization involves the planning or selecting of an I/O (input/output)strategy and a sort strategy based upon the user-supplied information of the names of the input and sorted output files, the number and size of input records to be sorted, the fields in the input data by which the records will be sorted (i.e. the key), and the "computer-supplied" information of the amount of memory space available for records and strings of records.
The next phase of the sort operation is the reading and string sorting phase. The records are read into the computer's memory, unless already resident therein, and are ordered, or sorted, according to their key value and thereby "assembled" into strings. In this way, the string sorting phase generates one or more strings. If all of the records cannot be held in the computer's internal memory at one time, then some of the generated strings must be stored in a temporary file while successive sections of the input file are processed.
When the reading and string sorting phase produces more than one string, the merge phase is performed. Some number of data records from each of the sorted strings is read (if external), merged and sent to the output phase. This process is repeated until all records have been merged. The output phase sends the records to the output destination, as specified by the user, which destination may be a file, user-program, or peripheral.
When there are more strings input to the merge phase than can be merged together at one time, it is necessary to perform several passes in the merge phase in order to merge all of the sorted strings. In this case, the output of each intermediate pass may be sent to a temporary file, later to be merged with other strings. In the case of multiple passes, the final pass is as described above, with the output of the merge of intermediate strings being sent to the output destination. It is beneficial to reduce the number of strings entering the merge phase, because a larger number of strings can require a multiple pass merge phase, and because a larger number of strings necessitates a larger number of compare operations per record in the merge phase. [See Knuth, supra, Chapter 5.3].
One technique, which has been used in the art to reduce the number of strings to be merged, is to concatenate strings for which the highest key value of one string is lower than the lowest key value of another. For example, as conceptually illustrated in FIG. 3A, in the instance of merging six strings input to the merge including: AARON through HOYLE, BROWN through JACOB, LOWRY through MASON, MORSE through OCEAN, MYERS through SMITH and ROGER through ZELDA, it is possible to concatenate strings. Line 40 indicates the fact that the six strings will be treated as two strings. Specifically, AARON through HOYLE, LOWRY through MASON, MORSE through OCEAN and ROGER through ZELDA can be concatenated into a single string. Similarly, BROWN through JACOB and MYERS through SMITH can be concatenated. Thence, only two strings remain to be merged rather than six. The information as to the highest and lowest key values for each string may have been saved off in a list during the sort phase, as illustrated at 16 in FIG. 1, or may be gathered at the start of the merge phase by reading in the first and last records of the string.
The present invention addresses the merge phase of the overall sort process and provides a superior method for merging the sorted strings, the method being equally applicable to the merging of files in a computer merging operation. For the sake of clarity of description, the file merging example will not be referred to continually; rather, in the use of the term "strings", the analogous "files" will be understood to be included. Where applicable, the generic term "sets" will be used to include both strings of sorted records and files of sorted records. The invention provides a method which can beneficially reduce the number of compares to which any single record is subjected, reduce the number of merge passes required to completely merge the records, and decrease the number of I/O operations necessary to complete the sort and/or merge.
It is therefore an objective of the present invention to provide a method of conducting a sorting operation using fewer of a computer's resources and less time than prior art sorting techniques.
It is another objective of the invention to provide a new method of conducting the merge phase of a sort process.
It is still another objective of the invention to provide a new method of conducting a computer file merge operation.
Another objective of the invention is to provide a merge process with a reduced number of passes in a multiple pass merge.
A further objective is to increase the buffer size available during a merge operation or merge phase.
It is yet another objective to reduce the number of strings that are merged to an intermediate storage location, thereby reducing the reading and writing of data during a multiple pass merge phase.
Still another objective of the invention is to consider only a subset of the data at one time during the merge phase, thereby reducing the number of comparisons during the merge phase.