Relational data files store data in the format of records and fields. Examples of such data include transaction tables, event logs, business reports, etc. Massive volumes of relational data are produced daily in large business and information systems, from gigabytes in banking and telephone services to terabytes in IP network monitoring and management systems. Thus, good compression is an important component in these systems to manage costs.
Compression research has evolved over the years from studying only general information models to finding ways to exploit specific structures in data. In H. Liefke and D. Suciu. Xmill: An Efficient Compressor for XML Data, In Proceedings of SIGMOD, pages 153-164 (2000), the contents of which is hereby incorporated by reference herein in its entirety, the authors discussed how XML files could be compressed by grouping data with the same tree paths together. Their work was inspired by the Pzip compressor described in A. Buchsbaum, G. S. Fowler, and R. Giancarlo, Improving Table Compression with Combinatorial Optimization, J. of the ACM, 50(6):825-51 (2003) (hereinafter “Buchsbaum et al.”), the contents of which is hereby incorporated by reference herein in its entirety, for a special type of relational data, namely, tables or two-dimensional arrays of bytes. Pzip introduced the idea of fixing some general purpose compressor, then grouping together columns amenable to compress well with that compressor. A different approach to table compression was later introduced B. D. Vo and K.-P. Vo, Compressing Table Data with Column Dependency, Theoretical Computer Science, v. 387, pp. 273-283 (2007) (hereinafter “Vo and Vo”), the contents of which is hereby incorporated by reference herein in its entirety, by automatically discovering certain dependency relations among table columns and using that to reorder data to enhance compressibility.
The use of compression to improve database storage and access has been widely studied, especially along with field-oriented storage schemes. The authors of J. Goldstein, R. Ramakrishnan, and U. Shaft, Compressing Relations and Indexes, ICDE (1998), the contents of which is hereby incorporated by reference herein in its entirety, observed that field data are often sparse within their much larger ranges and developed a frame of reference approach to compactly code such data. In M. Poess and D. Potapov, Data Compression in Oracle, VLDB (2003), the contents of which is hereby incorporated by reference herein in its entirety, the authors discussed how the Oracle DBMS saved space by replacing commonly occurring field attributes by pointers to distinct instances stored in some dictionary. V. Raman and G. Swart, How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations, VLDB (2006) (hereinafter “Raman and Swart”), the contents of which is hereby incorporated by reference herein in its entirety, proposed a more comprehensive approach to compress database tables based on exploiting value sparsity, field correlation and lack of record order.
The present disclosure focuses on the problem of compressing relational data files. Despite the apparent similarity, there are notable differences between compressing a database table and compressing a relational data file:                Unordered vs. Ordered: In a database table, record ordering is immaterial as queries can return retrieved records in any order. By contrast, the order of records in a relational data file is often meaningful due to implicit but often unknown factors such as time series data or categories in a presentation or report. As such, a compressed relational data file should always decompress into its exact original state.        Typed vs. Typeless: Schemas in a database specify precisely the type of each field and the association of such fields in their relations. However, such meta-data are often unavailable with a relational data file. That is, little can be assumed beyond being able to partition such a file into sequences of bytes representing records and fields. Any further structures must be automatically deduced.        