This specification relates to encoding and compress of statistical data, such as log records.
Statistics for monitoring the health of a system can require several thousand variables ranging from hardware, peripheral, network, operating system, memory, process variables indicating errors, exceptions and events such as virus detection counts and content classification counts. These variables are sampled over small sampling periods for logging purposes. Logging these statistical data can create an extremely large data collection. For example, a system that logs millions of transactions per second generates log record data that are too large to store or transmit efficiently. Assuming four bytes of information per variable, and 5,000 variables sampled every second, the bandwidth requirement for statistics transmission alone is 160 Kbps.
Data compression can be used to increase storage and transmission efficiency. However, existing compression and encoding processes does not achieve compression ratios high enough to efficiently manage large amounts of statistics data, as they are generic techniques applied on data instances in statistics data. For example, using Run Length Encoding, a sequence of zeros can be compressed into a count as long as they are consecutive, and thus its compression ratio is dependent on the arrangement of the variables and their non-zero occurrences within this interval. Compression ratios of approximately 1:30 are typical under this scheme.
Another compression process is LZ77, which identifies common longest prefixes to be referenced later in the data. Compression ratios of about 1:4 are typical under this scheme. Another compression process is dictionary based compression in which an index is used to represent a word that longer than the index. Compression ratios of approximately 1:4 are typical under this scheme.
The compression processes above do not attribute any significance to individual data instances, e.g., statistical counts, to identifying the causes of redundancy, and thus all data are compressed by the same compression process. A compression process that does attribute significance to individual data is Robust Header Compression (ROHC). Here a flow of data instances, such as a sequence of packets between two fixed endpoints, is compressed by eliminating protocol header information that is known to be a constant, or known to vary by a constant amount between two consecutive packets of the flow. Because the scheme does not assume ordered delivery of data, periodic uncompressed packets are sent to synchronize the data between the sender and receiver. This compression process, however, does not identify redundancies across the fields of the protocol header; each header field is considered in isolation.
A different approach is predictive encoding used in video compression schemes. In these compression schemes, the movements of the objects across scenes are predicted to compress video data. These techniques, however, are not lossless, and thus yield high compression ratios at the expense of data quality.