The invention relates to encoding data records to allow the sortability of the records.
Alphanumeric records of a data base may consist of variable length and null fields. Variable length fields cannot be simply catenated prior to collating because characters of long fields may interfere with those immediately following a short field, thereby destroying the correct sort sequence. For example, consider the names "Franz, Fred" and "Franzen, Fred". If simple concatenation is used the strings "FRANZFRED" and "FRANZENFRED" result but "FRANZENFRED" the longer of the last names, collates before "FRANZFRED". This is clearly incorrect. Also, at least one data base manipulation language, Structured Query Language (SQL), defines the concept of a null field. The null field should sort before any other field which has an actual value. It should even sort before a field consisting of all zeros. Since no sequence of bytes is less than all zeros, an encoding method must be provided to represent nulls. Furthermore, this encoding scheme should provide a method whereby the "Franz, Fred"--"Franzen, Fred" names collate correctly. An encoding method which provides a single string representing variable-length and null data fields within a data base record while preserving the correct sort sequence among multiple records is desirable. This encoding algorithm must also be reversible such that the original fields can be recovered from the encoded string.
One problem associated with current presort encoding techniques such as that shown in IBM Technical Disclosure Bulletin Vol. 19 No. 9, Feb. 1977, pages 3582-3583, Multifield Encoding For Unrestricted Strings, is that the resulting encoded strings can be very long. In this method, an integer value parameter, N, is chosen. The field to be encoded is padded with binary zeros so that its length is a multiple of N. To encode a two field string, each N bytes of the string are separated by a single `trigger` character which is `FF`X in this case. If the last N-byte substring of the field (i.e. the substring in which any padding occurred) is being processed then, instead of `FF`x, a byte which indicates the number of non-padded characters in this substring is appended. This indicates the end of a field. As can be seen above and in the following example, the encoded string is filled with excess baggage. In an environment where data is paged in and out of main storage, a sorting operation can take a long time because the data is spread out over more pages which must be retrieved from a relatively slow storage device. To encode the two fields `ABCDEF`,`XYZ` assuming N=4 and using `//` to indicate catenation (`//` does not appear in the actual data), we get "`C1C2C3C4`//`FF`//`C5C6`//`0000`//`02`// `E6E7E8`//`00`//`03`" in the EBCDIC hexadecimal or base 16 notation of expressing alphanumeric data. This encoding technique does not handle null fields.