1. The Field of the Invention
This invention relates to systems, methods, and computer program products for compressing data in a database system.
2. Background and Relevant Art
Many conventional database systems use “B-Trees”, or some other similar “page”-based structure to store collections of structured data. One reason for this is that B-Tree systems generally provide efficient methods to store and access large amounts of dynamic data on slow media, such as tape or hard disk (“sub-storage”). Data such as this is typically more data than would ordinarily fit in Random Access Memory (“RAM”) To enable this level of efficiency, however, B-Tree systems make no assumption about what type of data is being stored, allowing the B-Tree systems to be flexible enough for most kinds of data. In particular, B-Tree systems generally limit the data to “tables” where each item is stored a row, with its elements stored in columns (the set of columns being the same for all items in the table). Each column is defined to contain a fixed size number or a string (either of a fixed size or of variable size).
As computer CPUs increase in performance relative to disk systems, the relative cost of using variable-sized columns, and compression, has fallen. In general, compression algorithms remove redundancy in data, thus making the data smaller. This is generally desirable since storing the original version of the data on disk often takes longer than it takes to both compress the data and store the smaller or compressed version of the data on disk. A number of different types of compression have been implemented to remove redundancy in data to provide such storage efficiencies. For example, compression can shrink the column data before it is put into the columns (i.e., “intra-row” compression), such as if the table system supports the type of data that is being put in, and is still able to sort the rows. In another exemplary system, compression is utilized to shrink the size of the resulting pages (i.e., “inter-row” compression). Unfortunately, neither of these compression systems is ideal in conventional databases.
By way of explanation, “intra-row” compression involves applying compression before the values are entered into columns, since the compression works within a single row. Unfortunately, the storage savings from intra-row compression are minimal since most of the data redundancy in a page-based database is between the rows in a table, not within the rows. By contrast, “inter-row” compression, or compression of several rows in a table, results in much better compression, but results in chunks of data that are of different sizes (i.e., each page started out the same size but the compression works differently on each one, resulting in different sizes). Since “inter-row” compression results in chunks of data that vary in size, inter-row compression is generally used with a sub-storage that supports storage and retrieval of variable-sized data. Unfortunately, compressing to variable-sized data chunks, such as with inter-row compression, can result in significant performance degradation. In particular, much of the space savings offered by inter-row compression is wasted by the sub-storage system as the sub-storage tries to compensate for having to support variable-sized chunks.
One example of a conventional inter-row compression system in a database is a database that uses a “symbol table”. In particular, a database system such as this looks for common values for each column, and only stores one version of that value in the symbol table, which is also stored in the same page. The symbol table refers back to that value whenever the value occurs again in columns stored in the same page. As such, this type of compression is an example of inter-row compression, since the compression works by looking at values common in more than one row in the table. The problem of variable-sized chunks of data is solved by applying the compression as the items are placed into the pages.
An example of intra-row compression, called “gamma” encoding, includes one type of a full-text indexing system. For example, a full-text indexing system that uses gamma encoding may assume that smaller numbers are used much more frequently than large ones. The system then stores numbers with a variable number of bytes, where small numbers only take a small number of bytes, and large numbers take more bytes (even more than their corresponding normal fixed-width representation). Where the smaller numbers are represented more frequently in the indexing system, the gamma encoding can provide measurable space savings.
Another example of an inter-row compression algorithm is a “delta” (.i e.; difference) compression, which uses the delta, or difference, between rows to identify data. This type of compression is sometimes used to store databases, such as one used to store a dictionary in as small a space as possible, where many of the data terms have at least some similarity. In particular, a delta compression algorithm takes advantage of the fact that words in the dictionary, when stored in order, frequently start with a sequence of letters identical to the previous word in the list. For example, after “rabbi”, the next word in the dictionary might be “rabbit”. The word “rabbit” could be stored represented as “5t”, indicating that the first 5 letters of this word are the same as in the previous word, but then adds the letter “t” to the end.
Unfortunately, conventional databases do not take advantage of the different types of compression algorithms, and tend to use only one type of compression or another. One reason that this might be is that compression algorithms that result in smaller data also result in data that cannot typically be read or modified without decompressing or recompressing the entire data set. Since large sets of data are frequently subject to change due to the addition of new data, such a system of multiple compression algorithms does not work well. In particular, the accessibility and modifiability of the data is important, but are nevertheless subject to the need for smaller data sizes to accommodate very large databases.
For example, a database index using a B-Tree system might store the word “zoo”, but then use a separate (non page-base) data stream to store the corresponding list of rows that the word “zoo” exists in, using delta compression (storing only the difference between numbers in an increasing sequence) and gamma compression (storing smaller numbers using less bytes). This could provide size savings advantage, but nevertheless require a separate system for storing the data stream containing the list of rows, since the data could not be stored in the B-Tree. For example, a database utilizing this type of design might only be able to update the index (for example, removing a document from the system) by completely rewriting the index stream. As such, this type of database makes modifications to the database very cumbersome as a trade off for space gains from compression.
Accordingly, an advantage in the art can be realized with systems, methods, and computer program products that efficiently combine the benefits of several compression algorithms into a single database system, while retaining the system's ability to efficiently make incremental changes to the data.