The amount of data generated, collected and saved by businesses is increasing at an unprecedented rate. Businesses are retaining enormous amounts of detailed data, such as call detail records, transaction history, and web clickstreams, and then mining it to identify business value. Regulatory and legal retention requirements add to this burden by requiring businesses to maintain years of accessible historical data.
As businesses enter an era of petabyte-scale data warehouses, advanced technologies, such as data compression, are increasingly utilized to effectively maintain enormous data volumes in the warehouse. Data compression reduces storage cost by storing more logical data per unit of physical capacity. Performance is improved because there is less physical data to retrieve during database queries.
One current technique for compressing data, known as Value List Compression, may be applied to compress column data within a database table. A set of values are identified in a dictionary and any occurrence of a dictionary value in a row is compressed in the sense it is not recorded in the row, but a pointer to the dictionary value is recorded in the row header. An occurrence of a value that is not in the dictionary is stored as an uncompressed value.
Value List compression techniques can also be applied to a column partitioned table which stores multiple column values in a row. Such rows are called container rows in this document. Two sets of structures exist in a container row for this purpose: a fixed length compression dictionary, called static compression dictionary, SCD; and a list of uncompressed values, called the uncompressed column value list.
After a container row has been auto-compressed and stored, values that arrive later for insertion into the table that are not available in the SCD are no longer added to the SCD, but are appended to the uncompressed column value list in the container row. The uncompressed value list is in the order of data arrival and therefore in the order of row-id. The uncompressed column value list is not sorted and built dynamically as values are inserted. In a container row there are some bits for each base table row value that is stored in the container row. These bits include information such as where the value is recorded in the row, whether the value exists or is NULL, whether the value exists in the dictionary or is uncompressed and so on. These bits are collectively called “auto compression bits” in this document.
The above structure and method for compressing container row data have the following drawbacks:                1. Access to an uncompressed value in a container row is expensive since two passes are needed to process the auto compression bits—a first pass to determine if a value is compressed; and if not, a second pass to determine the value's positional number within the uncompressed value list.        2. Inefficient compression. When a new value is stored as an uncompressed value it can only be added to the uncompressed list since the storage structure is a positional list.        3. Access to an uncompressed value degrades linearly with the number of values in the uncompressed list.        4. If a compressed value in a row is modified to an uncompressed value, the modification cannot occur in place. Instead the old row must be deleted and a new row must be inserted with a different row-id.        
Described below is an improved database compression scheme that overcomes the disadvantaged discussed above.