1. Field
The present invention relates to computers. More particularly, the present invention relates to method and apparatus of data compression for computer networks.
2. Description of Related Art
Network traffic data collection is important for network measurements. Efficient measurements are critical to further network researches, like traffic engineering, intrusion detection and network topology modeling. However network data collection is difficult in high speed links because of large data volumes. Data compression is an effective method to overcome this difficulty. This is not only because data compression can reduce the storage space, but also because it can increase the link efficiency in transmission, and improve the performance of on-line traffic monitors. More details can be found in Gianluca Iannaccone, Christophe Diot, Ian Graham, Nick McKeown, Monitoring very high speed links, In Proc. of the 1st ACM SIGCOMM Workshop on Internet Measurement, 2001, V. Jacobson, Compressing TCP/IP headers for low-speed serial links, RFC 1144, Network Information Center, SRI International, Menlo Park, Calif., February, http://rfc.dotsrc.org/rfc/rfc1144.html, and also M. Degermark, B. Nordgren and S. Pink, IP Header Compression, RFC 2507, http://rfc.dotsrc.org/rfc/rfc2507.html, 1999. There are two major goals in designing a good compressor for network traffic data.
(1) High compression ratio: Compression method has to deal with very large data-sets. Network traffic data is huge. For example, in backbone, one hour collection of TCP/IP header on a 10 Gb/s link can easily reach 3 Tb. Disk and memory space become a major concern. Compression has to be fast.
(2) Low compression time: Some of the compressors are installed on network monitors. The high arrival rate of packets leaves very short time for the compressor to work. A lot of work has been done on compressing traffic data. The popular one is RFC TCP/IP header compression. More detail can be found in V. Jacobson, Compressing TCP/IP headers for low-speed serial links, RFC 1144, Network Information Center, SRI International, Menlo Park, Calif., February, http://rfc.dotsrc.org/rfc/rfc1144.html and also M. Degermark, B. Nordgren and S. Pink, IP Header Compression, RFC 2507, http://rfc.dotsrc.org/rfc/rfc2507.html, 1999. For low-speed serial links, because most packet sizes are small, the line efficiency is very bad. Line efficiency is defined as the ratio of the data to header plus data in a packet. For example, for 1 byte data, we still need 40 byte TCP/IP header to transmit it. Prior work tries to reduce transmission bandwidth or latency by improving the line efficiency. For example, this is done by replacing the header with the connection index the packet belongs to. This can reduce the header size to as small as 3 bytes. Another goal for traffic data compression is to reduce the storage space. Because there is a lot of similarities between the consecutive packets belong to the same flow. Flow-based TCP/IP header compression has become popular in traffic data compression. More details can be found in R. Holanda, J. Garcia, A New Methodology for Packet Trace Classification and Compression Based on Semantic Traffic Characterization, ITC19, 2005, Raimir Holanda, Javier Verdu, Jorge Garcia, Mateo Valero, Performance Analysis of a New Packet Trace Compressor based on TCP Flow Clustering, ISPASS 05, 2005, Gianluca Iannaccone, Christophe Diot, Ian Graham, Nick McKeown, Monitoring very high speed links, In Proc. of the 1st ACM SIGCOMM Workshop on Internet Measurement, 2001, and also Yong Liu, Don Towsley, Jing Weng and Dennis Goeckel, An Information Theoretic Approach to Network Trace Compression, UMass CMPSCI Technical Report 05-03. Although these methods fully studied the information redundancy between traffic records on header-level, flow-level, and even spatial-level (records on different monitors), they ignore the structure inside the record. Group compression focuses on this structure. It can be applied to compression on any level. Group compression treats the traffic records as a table. Table compression was first introduced by Buchsbaum et al.; they find the best partition of the columns on either original column sequence or re-ordered columns each partition is compressed separately. More details can be found in Glenn Fowler, Adam Buchsbaum and Don Caldwell and Ken Church and S. Muthukrishnan, Engineering the Compression of Massive Tables An Experimental Approach, In Proc. 11th ACM-SIAM Symp. on Discrete Algorithms, pp. 175-184, 2000, and also Glenn Fowler, Adam Buchsbaum and Raffaele Giancarlo, Improving Table Compression with Combinatorial Optimization, In Proc. 13th ACM-SIAM Symp. on Discrete Algorithms, pp. 213-22, 2002. They use heuristics to find the order of the columns. However these heuristics considers only the pair-wise relationship between the columns, which is not accurate for the total ordering of columns in the partition. Spartan is another approach that divides the table into predictive and predicted columns. More information regarding Spartan can be found in S. Babu, M. N. Garofalakis, and R. Rastogi, Spartan: A model based semantic compression system for massive data tables, In Proc. of ACM SIGMOD Int'l Conference on Management of Data, 2001, and also M. Garofalakis and R. Rastogi, Data Mining Meets Network Management The Nemesis Project, ACM SIGMOD Int'l Workshop on Research Issues in Data Mining and Knowledge Discovery, May 2001. They compress predictive columns only while the predicted columns are derived from predictive columns. The method is good for detecting and using functional dependence in database table for lossy compressions, because some error tolerance can increase the number of columns that are predicted and therefore result in better compression. But for the lossless compression of IP traffic data, the number of columns that can be predicted is very limited. Still, it does not consider the combined relationship between a set of columns like we do. Fascicles and ItCompress explore the relationships between rows as well. More details can be found in H. V. Jagadish, J. Madar, Raymond T. Ng, Semantic Compression and Pattern Extraction with Fascicles, Proc. of VLDB, pp.: 186-198, 1999 and also H. V. Jagadish Raymond T. Ng Beng Chin Ooi Anthony K. H. Tung, ItCompress: An Iterative Semantic Compression Algorithm, ICDE, 04. This is not suitable for on-line compression since the algorithms need to scan all the rows before deciding how to compress. It has to wait for enough data to be collected. Accordingly, there is a need for a compression method and apparatus for compressing network data that takes advantage of the structural similarity in various aspects of computer network data to achieve high compression ratio and fast performance.