The present invention relates generally to a method for compressing index pages in a database system as well as to an index compression converter.
Today, the amount of data stored and processed by database systems is growing at an accelerating pace. In parallel, demand for a more optimized manner of storing the growing amount of data increases, which essentially means that users expect to store more data in the same amount of space, for example by using compression techniques.
An increase in data volume means an increase in storage capacity, driving up storage costs as well as operational costs due to higher power requirements. The cost for electricity may be one of the biggest expenses in a data center. With the number of hard disks used (and the hard disk price also increases with higher access speed), the costs for electric power increase. Therefore, companies struggle with rising costs in respect to their data centers.
In relational databases, indexing tables with keys, including primary and secondary keys, is a common technique in order to improve access time to data. However, additional indexes also mean more required storage space on hard disks. The existence of indexes may also have the advantage that reading complete table entries may not be required if the data searched for in a table is already contained in an index. Whether or not to define one or more columns as an index on a table is a critical design criterion for a specific database. A user may find and define the best perceived compromise between access speed and additional required hard disk space.
One option to save hard disk space may be an index compression, in particular prefix compression. Common prefix compression is a compression technique that may be used to reduce the size of indexes. In case multiple index entries of a compound index have the same prefix, this prefix may only be stored once in a header of an index page.
On the other side, for performance reasons, it is recommended to use columns with the largest amount of distinct values as the leading columns in an index.
One problem with such an ordering may be that this column order (high cardinality columns first) may drastically reduce potential space savings provided by common prefix compression.
Several approaches to index compression in relational databases are known. U.S. Patent Publication No. US2010/0082545 discloses a method, an information processing system, and a computer program storage product for compressing sorted values. At least a first prefix and a second prefix in a plurality of prefixes are compared. Each prefix comprises at least a portion of a plurality of sorted values. A respective prefix comprises a set of consecutive characters including at least a first character of a respective sorted value. The respective sorted value further comprises a respective suffix comprising consecutive characters of the respective sorted value following the respective prefix. At least a respective first character of the first prefix and a respective first character of the second prefix are determined to be substantially identical. The first prefix is merged with the second prefix into a single prefix comprising the first character. A set of suffixes associated with the first prefix is updated to reflect an association with the second prefix.
In U.S. Pat. No. 7,653,643, another compression method is disclosed. A configuration management system uses a data compression method to compress entries in a data set. An entry is selected as a prefix value and prefix compression of the data set is performed. The entry to serve as the prefix value is quickly selected using an iterative approach. In each iteration, subgroups of entries are formed from groups formed in prior iterations based on the values of characters at successive positions in the entries. The approach is readily implemented using data structures represented as lists.