Effective analysis and compression of massive, high-dimensional tables of alphanumeric data is an ubiquitous requirement for a variety of application environments. For instance, massive tables of both network-traffic data and “Call-Detail Records” (CDR) of telecommunications data are continuously explored and analyzed to produce certain information that enables various key network-management tasks, including application and user profiling, proactive and reactive resource management, traffic engineering, and capacity planning, as well as providing and verifying Quality-of-Service guarantees for end users. Effective querying of these massive data tables help to continue to ensure efficient and reliable service.
Traditionally, data compression issues arise naturally in applications dealing with massive data sets, and effective solutions are crucial for optimizing the usage of critical system resources, like storage space and I/O bandwidth (for storing and accessing the data) and network bandwidth (for transferring the data across sites). In mobile-computing applications, for instance, clients are usually disconnected and, therefore, often need to download data for offline use.
Thus, for efficient data transfer and client-side resource conservation, the relevant data needs to be compressed. Several statistical and dictionary-based compression methods have been proposed for text corpora and multimedia data, some of which (e.g., Lempel-Ziv or Huffman) yield provably optimal asymptotic performance in terms of certain ergodic properties of the data source. These methods, however, fail to provide adequate solutions for compressing a massive data table, as they view the table as a large byte string and do not account for the complex dependency patterns in the table.
Existing compression techniques are “syntactic” in the sense that they operate at the level of consecutive bytes of data. As explained above, such syntactic methods typically fail to provide adequate solutions for table-data compression, since they essentially view the data as a large byte string and do not exploit the complex dependency patterns in the data structure. Popular compression programs (e.g., gzip, compress) employ the Lempel-Ziv algorithm which treats the input data as a byte string and performs lossless compression on the input. Thus, these compression routines, when applied to massive tables, do not exploit data semantics or permit guaranteed error lossy compression of data.
Attributes (i.e., a “column”) with a discrete, unordered value domain are referred to as “categorical,” whereas those with ordered value domains are referred to as “numeric.” Lossless compression schemes are primarily used on numeric attributes and do not exploit correlations between attributes. For instance, in certain page level algorithm compression schemes, each numeric attribute, its minimum value occurring in tuples (i.e., in “rows”) in the page, is stored separately once for the entire page. Further, instead of storing the original value for the attribute in a tuple, the difference between the original value and the minimum is stored in the tuple. Thus, since storing the difference consumes fewer bits, the storage space overhead of the table is reduced. Tuple Differential Coding (TDC) is a compression method that also achieves space savings by storing differences instead of actual values for attributes. However, for each attribute value in a tuple, the stored difference is relative to the attribute value in the preceding tuple.
Other lossless compression schemes have been derived that essentially partitions the set of attributes of a table T into groups of correlated attributes that compress well (by examining a small amount of training material) and then simply using gzip to compress the projection of T on each group. Another approach for lossless compression first constructs a Bayesian network on the attributes of the table and then rearranges the table's attributes in an order that is consistent with a topological sort of the Bayesian network graph. A key intuition is that reordering the data (using the Bayesian network) results in correlated attributes being stored in close proximity; consequently, tools like gzip yield better compression ratios for the reordered table.
Another instance of a lossless compression algorithm for categorical attributes is one that uses data mining techniques (e.g., classification trees, frequent item sets) to find sets of categorical attribute values that occur frequently in the table. The frequent sets are stored separately (as rules) and occurrences of each frequent set in the table are replaced by the rule identifier for the set.
However, compared to these conventional compression methods for text or multimedia data, effectively compressing massive data tables presents a host of novel challenges due to several distinct characteristics of table data sets and their analysis. Due to the exploratory nature of many data-analysis applications, there are several scenarios in which an exact answer may not be required, and analysts may in fact prefer a fast, approximate answer, as long as the system can guarantee an upper bound on the error of the approximation. For example, during a drill-down query sequence in ad-hoc data mining, initial queries in the sequence frequently have the sole purposes of determining the truly interesting queries and regions of the data table. Providing (reasonably accurate) approximate answers to these initial queries gives analysts the ability to focus their explorations quickly and effectively, without consuming inordinate amounts of valuable system resources.
Thus, in contrast to traditional lossless data compression, the compression of massive tables can often afford to be lossy, as long as some (user-defined or application-defined) upper bounds on the compression error are guaranteed by the compression algorithm. This is obviously an important differentiation, as even small error tolerances can help achieve better compression ratios.
Effective table compression mandates, therefore, using compression procedures and techniques that are semantic in nature, in the sense that they account for and exploit both (1) the meanings and dynamic ranges of individual attributes (e.g., by taking advantage of the specified error tolerances); and, (2) existing data dependencies and correlations among attributes in the table.
Accordingly, what is needed in the art is a system and method that takes advantage of attribute semantics and data-mining models to perform guaranteed error lossy compression of massive data tables.