§1.1 Field of the Invention
The present invention concerns network security. More specifically, the present invention concerns storing network traffic information for later analysis, and searching such stored information.
§1.2 Background Information
§1.2.1 The Need for Network Forensics
In networking, “attribution” is the problem of determining the source and/or the destination of some instance of traffic. For Internet Protocol (“IP”) networks, this problem arises both at the level of individual packets and at the higher level of payloads. (Payload information may be contained in a single packet, or may span two or more packets.) At the level of packets, attribution of source is difficult because the source IP address can be spoofed. At the level of payloads, determining which source sent or received a particular sequence of data is difficult because of the lack of reliable logging mechanisms in networks. Attackers often use “zombie” or “stepping stone” hosts as intermediaries. Therefore, being able to attribute traffic to a particular source host is not a network security panacea. However, determining intermediates is useful because it brings us a step closer to the attacker, making it a valuable start in tracing attackers.
Several packet marking schemes have been proposed to meet the first hurdle (See, e.g., S. M. Bellovin, M. Leech, and T. Taylor, “ICMP traceback messages,” Internet Draft draft-ietf-itrace-01.txt (Work in progress), IETF, October 2001, D. Dean, M. Franklin, and A. Stubblefield, “An algebraic approach to {IP} traceback,” Proceedings of NDSS, February 2001, I. Hamadeh and G. Kesidis, “Packet marking for traceback of illegal content distribution,” Proceedings of International Conference on Cross-Media Service Delivery (CMSD), Santorini, Greece, May 2003, I. Hamadeh and G. Kesidis, “Performance of ip address fragmentation strategies for ddos traceback,” Proceedings of IEEE IPCOM, Kansas City, October 2003, S. Savage, D. Wetherall, A. Karlin, and T. Anderson, “Practical network support for IP traceback,” Proceedings of the 2000 ACM SIGCOMM Conference, pages 295-306, Stockholm, Sweden, August 2000, D. Song and A. Perrig, “Advanced and authenticated marking schemes for IP traceback,” In IEEE Infocomm, 2001.). These tools, however, require a set of network packets to do traceback and are not useful when one does not have the packets, but rather an excerpt of the payload only.
In order to meet this additional hurdle, tools have been developed that can record the entire network traffic and archive them for short periods of time (See, e.g., Infinistream at http://www.networkgeneral.com/, C. J. Antonelli, M. Undy, and P. Honeyman, “The packet vault: Secure storage of network data,” Santa Clara, April 1999. Proc. USENIX Workshop on Intrusion Detection and Network Monitoring, Sandstorm Enterprises, NetIntercept at http://www.sandstorm.com, A. Mitchell and G. Vigna, “MNEMOSYNE: Designing and implementing network short-term memory,” International Conference on Engineering of Complex Computer Systems, IEEE, December 2002.). However, since the means of cybercrimes are usually not known in advance, for example signature of a new worm, it would be useful to store the network traffic for weeks or even months so that one can go back in time to investigate earlier incidents. Unfortunately, recording raw network traffic not only makes it infeasible to archive the traffic for prolonged periods of time, but also raises serious privacy issues. These drawbacks limit the usefulness of these tools.
§1.2.2 Traceback Techniques and their Limitations
In A. C. Snoeren, C. Partridge, L. A. Sanchez, C. E. Jones, F. Tchakountio, S. T. Kent, and W. T. Strayer, “Hash-based IP traceback,” ACM SIGCOMM, San Diego, Calif., USA, August 2001, Bloom filters have been used in a Source Path Isolation Engine (“SPIE”) to trace IP packets over networks. SPIE creates hash-digests of packets and stores them in a router using Bloom filters. With wide deployment, SPIE can track a single packet to its source on the Internet. A SPIE-equipped router creates a packet digest for every packet it processes using the packet's non-mutable header fields and a prefix of the payload. These digests are then kept at the network component for a preset amount of time. When an attack is detected by a security component, such as a firewall or an intrusion detection system, it can present the offending packet to SPIE to trace the packet's path and the originator of a packet can be identified by launching enough queries along the packet's network path. With SPIE, any single packet can be traced back to its origin as long as all the nodes on the packet's path have not yet discarded the packet digest. Unfortunately, this makes the system unsuitable for forensics applications where information may be needed from a significantly distant past. Further, SPIE requires that a query present the whole packet, or at least the non-mutable headers and prefix of the payload, to do traceback. However, in most cases the exact packet that carried a certain payload might not be known.
More recently, Abhishek Kumar, Jun Xu, Jia Wang, Oliver Spatschek, and Li Li, “Space-code bloom filter for efficient per-flow traffic measurement,” Proceedings of IEEE INFOCOM, Hong Kong, China, March 2004 proposes a novel technique, called Space Code Bloom Filters, for approximate measurement of traffic flows. Unlike previous measurement schemes, a novel Bloom filter based data structure, known as Space-Code Bloom Filter enables the method to track all traffic flows instead of just “heavy hitters.”
Finally, S. Dharmapurikar, M. Attig, and J. Lockwood, “Design and implementation of a string matching system for network intrusion detection using FPGA-based bloom filters,” Technical Report, CSE Dept, Washington University, 2004, Saint Louis, Mo. proposes using Bloom filters for intrusion detection by matching known signatures to the payload. The problem that they tackle, often referred to as deep packet inspection, involves detection of predefined signature strings or keywords starting at an arbitrary location in the payload. The Bloom filter contains the signatures and is used for intrusion detection, not forensics with respect a past security breach. Hence archiving and storing the Bloom filters is not a concern for them. Rather, their challenge is performing their query at line speed over all possible alignments and various block lengths.
Over the past few years extensive research has been done in developing feasible solutions to trace back network traffic to its source on the Internet. Traceback systems can generally be grouped into three broad categories: (1) Traceback of single packets (e.g. SPIE); (2) Traceback of network floods; and (3) Traceback of connection chains. SPIE was introduced above. We now briefly summarize work in the other two areas below.
Distributed denial of service attacks create large uncorrelated network flows towards a particular host or a set of hosts. Since source IP addresses are usually spoofed, the traceback of the attack to its source a non-trivial task. Researchers have proposed some clever solutions to the problem of tracing IP packets back to their source (IP traceback). Most of this work can be grouped into two main categories: one in which no extra network packets are generated (See, e.g., H. Burch and B. Cheswick, “Tracing anonymous packets to their approximate source,” Proc. USENIX LISA, December 2000, D. Dean, M. Franklin, and A. Stubblefield, “An algebraic approach to {IP} traceback,” Proceedings of NDSS, February 2001, S. Savage, D. Wetherall, A. Karlin, and T. Anderson, “Practical network support for IP traceback,” Proceedings of the 2000 ACM SIGCOMM Conference, pages 295-306, Stockholm, Sweden, August 2000, A. C. Snoeren, C. Partridge, L. A. Sanchez, C. E. Jones, F. Tchakountio, S. T. Kent, and W. T. Strayer, “Hash-based IP traceback,” ACM SIGCOMM, San Diego, Calif., USA, August 2001, D. Song and A. Perrig, “Advanced and authenticated marking schemes for IP traceback,” IEEE Infocomm, 2001.), and the other in which a few extra network packets are generated (See, e.g., S. M. Bellovin, M. Leech, and T. Taylor, “ICMP traceback messages,” Internet Draft draft-ietf-itrace-01.txt (Work in progress), IETF, October 2001, A. Mankin, D. Massey, C. L. Wu, S. F. Wu, and L. Zhang, “On design and evaluation of ‘intention-driven’ ICMP traceback,” Proc. IEEE International Conference on Computer Communications and Networks, October 2001.). These mechanisms, however, are not effective against attacks that require a relatively smaller amount of packets. An encoding strategy proposed in I. Hamadeh and G. Kesidis, “Packet marking for traceback of illegal content distribution,” Proceedings of International Conference on Cross-Media Service Delivery (CMSD), Santorini, Greece, May 2003, I. Hamadeh and G. Kesidis, “Performance of ip address fragmentation strategies for ddos traceback,” Proceedings of IEEE IPCOM, Kansas City, October 2003 requires that border routers establish a trusted region and encode the router's IP in all egress traffic. A destination host can then decode the IP address of the closet border router to source of a packet from the IP fragment-id field. None of these methods can be used against malicious network events that can only be defined by a packet's payload, like for example, uploading a company's confidential information to a public FTP site.
A related problem, at a different level of abstraction however, is tracing connection chains. Attackers often obscure their identity and location by forming a connection chain by logging into a set of compromised systems (“stepping stones”) before attacking a target. Tracing the attack from the victim takes us only to the last link in the chain but not to the location of the attacker. Methods are proposed to trace intruders through stepping-stones. The method proposed in S. Staniford-Chen and L. T. Heberlein, “Holding intruders accountable on the internet,” Proceedings of the 1995 IEEE Symposium on Security and Privacy, Oakland, 1995, creates “thumb-prints” of connections using packet content which can be compared to determine whether two connections contain the same text and are therefore likely to be part of the same connection chain. However, the method fails when the connections are encrypted. To address the problem Y. Zhang and V. Paxson, “Detecting stepping stones,” Proceedings of the 9th USENIX Security Symposium, Denver, Colo., USA, August 2000 proposes an algorithm that doesn't rely on traffic content, instead relies on packet sizes, packet intervals, etc. to identify stepping stones.