Techniques for efficiently storing and searching large databases are increasingly important in a variety of communications and electronic data storage applications. Extremely large databases are logically structured to minimize the search times expected for typical searches of the database. For example, data will be stored in records in one portion of the data base, but, to save the necessity of searching each record in order to find one in particular, a key may be provided which points to the physical location of each record, or to the location of a particular type of data within that record. These keys are arranged in ordered indices using configurations, such as B-trees, which may be efficiently searched. When a particular piece of data in a particular record is sought, the key index is searched to identify the desired record, and then the record is searched to find the desired piece of data. Because this form of search is so much more efficient than a linear search through the entire database, large databases normally include a number of ordered indices designed to facilitate searching in all of the commonly expected modes.
In addition, physical data compression techniques are often employed to reduce hardware costs, storage space requirements, and data transfer times. They are particularly attractive for suppressing runs of repeating characters or patterns, such as the trailing blanks or space-holder characters at the end of a fixed-length data field. Since each data item must be first compressed, and later decompressed, a compression algorithm which is cumbersome to implement will offset many of the efficiencies provided by the reduction in size of the data base, and, in the worst cases, may actually reduce system performance. Run-length compression is a popular compression technique because it provides significant data compaction for repeating characters or patterns using very simple compression and decompression algorithms. In conventional run-length compression, sequential repetitions of a target pattern beyond a predetermined threshold are replaced by a special character which signals that compression follows, the compressed pattern, and a count of the number of repetitions. Frequent choices for compression are "spacer" characters, such as asterisks, underscores, dashes and space characters. Thus, a run-length compression of the spaces (represented as x's for easier visualization) in the data sequences:
BENxxxxxxxTHOMASxxxx PA0 BEN-SUNGxxMARYxxxxxx PA0 BENTLEYxxxJOHNxxxxMD PA0 BENTLEYxxxJOHNxxxPHD PA0 BEN .times.7THOMAS .times.4 PA0 BEN-SUNG .times.2MARY .times.6 PA0 BENTLEY .times.3JOHN.times.4MD PA0 BENTLEY .times.3JOHN .times.3PHD PA0 BEN-SUNG .times.2MARY .times.6 PA0 BENTLEY .times.3JOHN .times.3PHD PA0 BENTLEY 3JOHN .times.4MD PA0 BEN .times.7THOMAS .times.4 PA0 BEN.times.7THOMAS.times.4 PA0 BEN-SUNG.times.2MARY.times.6 PA0 BENTLEY.times.3JOHN.times.4MD PA0 BENTLEY.times.3JOHN.times.3PHD PA0 BEN.times.7THOMAS.times.4 PA0 BEN-SUNG.times.2MARY.times.6 PA0 BENTLEY.times.3JOHN.times.3PHD PA0 BENTLEY.times.3JOHN.times.4MD
would yield (using as the special compression character) the much shorter:
However, conventional run length compression cannot be used with ordered data indices or other ordered data because it does not preserve the natural collating sequence of the data. This issue arises because the characters are represented, and manipulated, in the computer as binary codes, each of which corresponds to a numerical value. Different computers use different character set representations. For example, FIG. 3 shows the 8 bit ASCII character set representation, a standard representation commonly used in the industry. The bit representation is shown on the horizontal and vertical axes, with bits 1-4 on the vertical axis and bits 5-7 on the horizontal axis. The character represented by each combination of bits is shown in the grid, while the three small numbers next to each character in the grid, are the bit values in octal, decimal, and hexadecimal respectively. For example, the character A is represented by the bit sequence 1000001, which is &lt;101&gt; in octal, &lt;65&gt; in decimal and &lt;41&gt; in hexadecimal. For convenience, decimal values are used in this description, but this limitation is not required to implement the invention. Sequences of data may be collated in their "natural" sequence, which would be the numerical order of their bit values, or in a specially programmed collating sequence which maps the bit values to a different order. An example of a specially programmed sequence would be one in which the values &lt;65&gt;,&lt;66&gt;,&lt;97&gt;,&lt;98&gt; were always collated in the order ,&lt;65&gt;,&lt;97&gt;,&lt;66&gt;,&lt;98&gt; so that the intuitive (to people) ordering A,a,B,b was adhered to by the machine. Specially programmed collating sequences require added time and space in comparison with natural collating sequences, and also require that the natural collating order of the character set be known in advance to the programmer. However, the characters used as spacers and special compression characters tend not to collate naturally when concatenated with the alphanumeric characters, and tend not to be assigned the same relative character values in different character sets.
In the example shown above, the natural collating sequence of the compressed data would depend on what value the character was assigned relative to the alphanumeric characters. In the ASCII character set shown in FIG. 3, the capital letters have the character values &lt;65&gt; to &lt;90&gt;, the numerals have the values &lt;48&gt; to &lt;57&gt;, --(dash) has the value &lt;45&gt;, space has the value &lt;32&gt;, and has the value &lt;92&gt;. If the natural collating sequence of this character set were adhered to, the compressed sequence would collate as:
Because the special compression character has a higher value than the letters or the dash, and because the compression count is collated as if it were a character in the sequence, the natural collating order would not be preserved.
If the special compression character were omitted because only one character was being compressed, a common situation in databases where there is a single spacer character which repeats much more frequently than any other character, the sequence would compress as:
and, again using ASCII as an example, would collate as:
Again, the natural collating order is lost due to the compression. In this instance, it is the run count, rather than the special character, which interferes with the natural collating sequence. Collating errors caused by the run count are very difficult to correct for by programming a special collating sequence, since the count can vary over a broad range of values.
The collating sequence limitation significantly restricts the possible areas of application for run length compression and forces data base and telecommunications system designers to rely on significantly less efficient compression techniques for data base key indices and other ordered or sortable data base structures.