This relates to data compression.
Modern information systems routinely generate and store massive amounts of data in data warehouses to manage ongoing operations. Data are often produced and stored in a common format called the “relational table”. Each such table consists of a set, or a sequence, of records, and each contains fields that store data values. It is not unusual for a relational table to have several millions of records with each record containing thousands of bytes. As a result, many terabytes of data are often kept on-line at a storage cost that is measured in the tens to hundreds of millions of dollars. On top of that, large transmission costs are frequently incurred in electronically transporting data between systems. Thus, good compression of relational tables can have significant financial impact on the management and operation of these large information systems.
The internal structure of a relational table varies depending on specific needs. In some cases, a data file may be kept in textual form with records being text lines and fields being strings separated by some field delimiter. In other cases, a record may consist of fields that have a fixed length; i.e., a fixed number of bytes. Yet, other types of record might have variable representations with some extra encoded data to tell which representation is which.
Certain data come in a “flat table” form which consists of a two-dimensional array of bytes with known numbers of columns and rows. A relational table whose records all have the same length can be thought of as a flat table if we ignore the field structure within records and treat each record as a row.
In “Engineering the Compression of Massive Tables: An Experimental Approach,” Proc. 11th ACM-SIAM Symp. on Disc. Alg., pp. 175-184, 2000, Buchsbaum et al considered the problem of compressing flat tables, and have developed what they called the Pzip algorithm. This algorithm assumes some external conventional compressor as a basic primitive and defines the compressive entropy of a data set as the size after being compressed by this compressor. Then, columns are grouped to improve overall compressive entropy when compressed in groups. Since computing an optimum column grouping is NP-hard, a two-step solution is employed. Columns are first reordered by a traveling salesman tour that keeps pairs compressed well together close in the ordering. Then, the ordered columns are segmented by a dynamic program to reduce overall compressive entropy. If n is the number of columns, the dynamic program alone would require O(n3) steps, each compressing some segment of columns.
The process of column grouping in Pzip can be quite slow, sometimes taking hours for tables with just a few hundred columns. Therefore, per class of tables, Pzip typically first restricts itself to a small amount of training data to do column grouping, and then uses the results for all tables in the class. This approach works fine as long as table characteristics are consistent, but poor compression performance can result when they are not.
Columns in a flat table may be dependent on one another in the sense that the content of a column may be closely predictable by that of another column or a group of other columns. Predictability among columns implies information redundancy which could be exploited to enhance compression. In a paper title “Compressing Table Data with Column Dependency”, Theoretical Computer Science, vol. 387, Issue 3, pp. 273-283 (November 1007), Binh Dao Vo and Kiem Phong Vo formalize the notion of column dependency as a way to capture this information redundancy across columns and discuss how to automatically compute and use it to substantially improve table compression.
The problem of compressing relational tables has been studied in the literature. A white paper “Oracle Advanced Compression” by the Oracle company describe a method to construct a dictionary of unique field values in a relational table, then replace each occurrence of a value by its dictionary index. Since field values can be long, removing duplication in this way does reduce the amount of storage space required. In a different paper, “How to Wring a Table Dry: Entropy Compression of relations and Querying of Compressed Relations” (Very Large DataBase Conference, 2006), V. Raman and G. Swart discusses how to take advantage of the skew distribution of values in a field, correlation across fields within a record, and the unordered nature of records in a database to compress data. By treating records as being unordered, this method does not preserve the given order of records in a file. Hence, the method is not lossless.
There remains a need for a compression method for relational tables in a manner that is both lossless and effective.