In the current age of information technology, massive volumes of data are generated, stored and processed, to meet innumerable needs. Over the years, much effort has been devoted to developing better data storage and sort technologies, in order to handle the data expansion that has occurred, both in volume and use of data.
One aspect of efforts to manage the ever-expanding volume of and reliance on data involves the evolution of database technology and, specifically, relational database technology. In relational databases, rows are composed of multiple columns (c0, c1, c2, . . . ). For example, FIG. 1 is a block diagram that illustrates a database table 100, containing rows (data items) that have multiple columns (“SURNAME”, “FIRST NAME”, “SOCIAL SECURITY NO”). During processing of data from a database, the data items extracted from rows are frequently ordered by more than one associated column or field. The fields by which data items are sorted are referred to as sort keys. For example, with table 100 (named “emp”), a query on the table may be as follows:                select * from emp order by surname, first_name, social_security _no.In this example, the surname, first_name, and social_security_no fields are all sort keys.        
Requests for sorting of data often include various options for one or more of the sort keys, such as (1) whether data items having a sort key with the value “null” are ordered first or last relative to that sort key; and (2) whether data items are ordered in ascending or descending order relative to sort key values. For example                select * from emp order by surname nulls last, first_name, social_security_no descending.This example specifies that nulls are ordered last for the surname key and that the data items are to be sorted in descending order for the social_security_no key.        
Most columns are byte orderable. A column is byte orderable when the values in the column can be represented as an array of bytes, and the order between any two values in the column can be determined by comparing bytes of the arrays that represent the values, at the same index into the arrays, until an index is found at which the bytes differ.
In operations in which rows are sorted by more than one sort key (e.g., more than one column), where each sort key is separately byte orderable, there are two general approaches for encoding values from the sort keys. The first approach is to compare values for one sort key at a time, from each row, until the rows do not have the same value for a given key. The second approach is to concatenate the bytes that represent respective values for the sort keys for each row, where the bytes for each key are ordered within the array based on the parameters of the sort request (e.g., a database query), thereby creating a contiguous array of bytes for each row. Then, one byte-wise comparison is performed for the concatenated byte arrays for two rows being compared. The second approach (referred to hereafter as the “concatenated key” approach) is typically more efficient than the first approach, and enables other optimizations.
For example, with reference to FIG. 1, the bytes that represent the values in the “surname” field (“field 1”) for the data items associated with “Thomas Smith” and “Alexander Jefferson” can be compared when sorting and ordering the data items. For data items being compared, arrays of bytes representing values for one or more fields are compared until the bytes do not match, at which point a determination is made as to which data item orders higher with respect to the other data item, for that particular key field. Such a determination is based on parameters governing the sort, such as “nulls order first” versus “nulls order last,” and ascending versus descending order.
Likewise, the bytes that represent the values in the “first_name” field (“field 2”) and the “social_security_no” field (“field 3) in one data item can be compared with the bytes that represent values for those fields in another data item to determine the relative order of data items based on those key fields. Using the concatenated key approach, arrays of bytes, each of which represents values in all of the key fields of a given data item, can be compared to order all the data items involved in the sort operation in response to the request.
When the sorting columns each have a fixed width, the concatenated key approach works well. Unfortunately, when a sorting column is variable width or may be null, the concatenated key approach fails. For example, with reference to FIG. 1, the array of bytes associated with “Alexander Jefferson” contains a null byte (field 1), which cannot be compared to the similar byte (field 1) for “Thomas Smith,” which contains a non-null value. For another example, FIG. 1 depicts that field 2 is a variable width field, where the number of bytes that represent the values in the “first_name” field vary from data item to data item. Thus, because it requires two bytes (for example only) to encode “Alexander” but only one byte to encode “Thomas”, a comparison of the byte(s) to order the data items based on field 2 yields incorrect results. This is because the later bytes of the longer value will be compared to the first bytes of the value for the next field. In the example, ignoring for now the fact that field 1 contains a null, the second byte of “Alexander” will be compared to the first byte of “Jefferson”.
One solution to the variable-width column challenge is described in “An Encoding Method for Multifield Sorting and Indexing,” by Blasgen, Case and Eswaran, CACM, November 1977, Volume 20, Number 11. This article describes a method for encoding variable-width columns by using extra “marker” bytes. That is, for every (N−1) bytes of a variable width column, a marker byte is written that indicates whether there is more data from the variable width column. For example, assume N=4, the variable width column contains the value “123” for one data item and “1234” for another data item, and the value of the marker byte is “M” when there is more data from the column and “D” when there is no more data from the column. The value in the column for the one data item is encoded with four bytes as “123D” and the value in the column for the other data item is encoded with five bytes as “123M4”. However, this approach depends on an accurate estimate of the average width of the encoded column. If the estimate is bad, or if the width of the column varies significantly, then excessive space may be used to encode the column.
Based on the foregoing, there is room for improvement in techniques for sorting information efficiently. Specifically, there is room for improvement in techniques for encoding data to be sorted according to the various ordering options described above.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.