Many applications deal with large amounts of data organizable as tuples, an example of such applications being databases. As the amount of data to be dealt with by such applications increases, their performance becomes constrained by the speed at which data can be read or written. Such an application is said to be input-output (I/O) bound. Unfortunately, technological progress in the computing arena has produced dramatic improvements in all aspects except I/O. I/O-bound applications are therefore the hardest to design and manage.
The usual approach when confronted with an I/O-bound application is to reduce the amount of I/O required. This goal may be realized sometimes by cleverly designing the application, so that it computes some of the data instead of reading it. However, this approach is limited in applicability and is often impossible to realize. A far more effective approach is usually to reduce the volume of data to be read by compressing it prior to I/O. The information content of the data is preserved, but the volume it occupies is greatly reduced.
A number of approaches are available for compressing data. Unfortunately, they are generally unsuitable for database-like applications, which require random access to data. Our method is specially designed to work for this class of applications. Most other existing methods assume that data are produced and consumed serially in a pipelined fashion. That is not always a valid assumption, and is definitely invalid in the database domain.
The present invention provides a method to compress and store data in relational databases that overcomes some of the deficiencies of prior art systems. In our method, each record R.sub.i is converted to a number n.sub.i. These numbers (or records) are next sorted according to some predetermined ordering rule (usually ascending or descending order). Next, for each record R.sub.i, we compute the difference d.sub.i between the number ni and the preceding number n.sub.i-1. Each such record R.sub.i is then represented by the corresponding difference d.sub.i.
This method exploits the characteristic that records share common field values. Arranging them in this fashion makes explicit the amount of similarity among records; records that are closer together have more common field values. Such commonality represents a redundancy that can be eliminated by capturing the distances among records. Thus, the set of records are replaced by their distances.
The invention has the following advantages: (1) differences between records are smaller than the records themselves, so using differences requires fewer bits of storage, achieving compression; (2) the original records can all be recovered, so information is not lost; (3) the encoding and decoding processes can be localized, so that only relevant records desired in a database query need be decoded and processed, avoiding costly decoding of the entire table when only a small subset of records are needed; (4) the method continues to support standard database operations such as insertions, deletions and updates; and (5) the encoding and decoding processes are efficient so that fast retrieval is possible.