Numerous methods are known for sorting information records. Typically, in such methods, one or more fields of each record of a sort file are designated as sort fields and the input records are ordered by an appropriate algorithm according to the data in the sort field of the records.
Most sorting methods require the data stored in a sort field to be of fixed length. Padding characters are prefixed to the stored data as necessary to achieve this property. For example, a number in a numeric sort field might be padded with leading zeros if the representation of the number requires less than the full length of the sort field. Alphanumeric information might, on the other hand, be padded with spaces.
Techniques are also known for ordering (sorting) records according to variable length, multiple sort fields. Most such techniques are complex because of the need to account for the variable length of the sort strings in each record. One relatively simple method is described in an article entitled "An Encoding Method for Multifield Sorting and Indexing," Communication of the ACM, November 1977, Volume 20, Number 11, by M. W. Blasgen et al. In this method, data in the individual sort fields are encoded. The encodings for each record are concatenated to form sort keys and the records are sorted by a character-by-character comparison of the sort keys of each record. The algorithm for forming the sort keys is as follows. For each record, each sort field data string is partitioned into substrings of fixed character length L. A continuation character formed with binary 1's is inserted into the string after each partition. Fill characters "0" are appended at the end of a string to lengthen the last partition to L characters, if necessary. In this event, additional numeric characters are appended to the string specifying the number of real characters of the string that are in the last partition. The remaining sort fields of a record are encoded in a similar manner and concatenated to form the sort keys for each record. The sort keys are then compared on a character-by-character basis to perform the sort.
As an example of the Blasgen algorithm, assume that a record contains two sort fields having the respective strings "abcdef" and "xyz" and that the partition length L is 4. The concatenated sort key would then be "abcdCef002xyz03", where "abcd" forms the first partition, "C" is a continuation character, "ef" is the remaining characters of the first sort string, "00" are padding characters, "2" is the number of real characters (ef) in the last partition, xyz is the first (and only) real character string in the second sort field, "0" is a padding character and "3" is the number of real characters in the partition for the second sort string.
The Blasgen encoding algorithm preserves the lexicographic order of the original sort field data in the sort key. However, this algorithm and other known algorithms have the disadvantage that they are unable to cope with sort fields containing more than one sort object. For purposes of this discussion, a sort object is one occurrence of a coherent data entity in a sort field. For example, a sort field entitled TELEPHONE NUMBER might have none, one, or more than one telephone number in the field. Each number in the field is considered to be a sort object.
As increasing amounts of information are stored and manipulated by data processing systems, the need arises for improved sorting techniques. For example, there is an increasing need for a sorting method which simply and economically allows sorting of information records containing multiple and variable length sort fields and in which each sort field is allowed to contain variable number of sort objects.
Up to the present time, no method is known that allows this sophistication and flexibility.