§1.1 Field of the Invention
The present invention concerns matching an arbitrary-length bit string with one of a number of known arbitrary length bit strings. The present invention may be used for network intrusion detection and prevention. In particular, the present invention concerns a novel data structure—namely, a trie bitmap content analyzer operating with a boundary hashing method—which provides minimum perfect hashing functionality while supporting low-cost set membership queries. By using such a data structure, determining whether an arbitrary length bit string matches a particular one of a number of known bit strings can be checked at high speed.
§1.2 Background Information
Network Intrusion Detection and Prevention Systems (“NIDPSs”) have a vital role in current state-of-the-art network security solutions (See, e.g., Sourcefire 3d. [Online]. Available: http://www.sourcefire.com, and Fortinet. [Online]. Available: http://www.xilinx.com.) Deep Packet Inspection (“DPI”) is at the heart of these NIDPSs. DPI is the detection of malicious packets by comparing the packet payloads against excerpts from known intrusion packets, (that is, against the intrusion signatures database). DPI consumes a large portion of processing power and memory for the NIDPS. Yet, achieving high-speed DPI for a low cost is a continuing challenge as the line rates and the number of intrusions continue to increase.
One way to address this challenge is to use Minimal Perfect Hash Functions (“MPHFs”) to search the signature database. (See, e.g., N. S. Artan and H. J. Chao, “TriBiCa: Trie Bitmap Content Analyzer for High-Speed Network Intrusion Detection,” in 26th Annual IEEE Conference on Computer Communications (INFOCOM), 2007, pp. 125-133.) An MPHF is a hash function that maps a set S of n keys into exactly n integer values (0 . . . n−1) without any collisions. (See, e.g., P. E. Black, “Minimal Perfect Hashing,” in Dictionary of Algorithms and Data Structures. U.S. National Institute of Standards and Technology, July 2006. [Online]. Available: http://www.nist.gov/dads/HTML/minimalPerfectHash.html.) MPHFs provide constant worst-case query time and minimal space. Thus, they are very suitable for DPI.
§1.2.1 Previous Approaches and Perceived Limitations of Such Approaches
For DPI in NIDPS, the data structure to store the intrusion signatures database should balance the requirements of high-speed, low-cost and easy update. DPI approaches in software NIDPSs such as Snort (See [Online]. Available: http://www.snort.org.) and Bro (See V. Paxson, “Bro: A System for Detecting Network Intruders in Real-Time,” Computer Networks, vol. 31, pp. 2435-2463, 1999.) are very flexible and support detection of sophisticated intrusions. However, they are not scalable for high speeds since they run on general-purpose hardware, which is intrinsically slow and has limited parallelism. Hence, hardware approaches are preferred for certain applications.
DPI approaches on hardware can broadly be classified into two architectures based on their signature storage media: (1) off-chip memory (See, e.g., F. Yu, T. Lakshman, and R. Katz, “Gigabit Rate Pattern-Matching using TCAM,” in Int. Conf. on Network Protocols (ICNP), Berlin, Germany, October 2004 and H. Song and J. Lockwood, “Multi-pattern Signature Matching for Hardware Network Intrusion Detection Systems,” in 48th Annual IEEE Global Communications Conference, GLOBECOM 2005, St Louis, Mo., November-December 2005.) and (2) on-chip memory and/or logic blocks (See, e.g., C. Clark and D. Schimmel, “Scalable Pattern Matching for High-Speed Networks,” in IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, Calif., 2004, pp. 249-257, Y. H. Cho and W. H. Mangione-Smith, “Fast Reconfiguring Deep Packet Filter for 1+Gigabit Network,” in FCCM, 2005, pp. 215-224, Z. K. Baker and V. K. Prasanna, “High-Throughput Linked-Pattern Matching for Intrusion Detection Systems,” in Proc. of the First Annual ACM Symposium on Architectures for Networking and Communications Systems, Princeton, N.J., 2005, pp. 193-202, J. Moscola, J. Lockwood, R. P. Loui, and M. Pachos, “Implementation of a Content-Scanning Module for an Internet Firewall,” in FCCM, 2003, pp. 31-38, I. Sourdis, D. Pnevmatikatos, S. Wong, and S. Vassiliadis, “A Reconfigurable Perfect-Hashing Scheme for Packet Inspection,” in Proc. 15th International Conference on Field Programmable Logic and Applications (FPL 2005), August 2005, pp. 644-647, L. Tan and T. Sherwood, “Architectures for Bit-Split String Scanning in Intrusion Detection,” IEEE Micro, vol. 26, no. 1, pp. 110-117, January-February 2006, G. Papadopoulos and D. N. Pnevmatikatos, “Hashing+Memory=Low Cost, Exact Pattern Matching,” in Proc. 15th International Conference on Field Programmable Logic and Applications (FPL), August 2005, pp. 39-44, Y. Lu, B. Prabhakar, and F. Bonomi, “Perfect Hashing for Network Applications,” in IEEE Symposium on Information Theory, Seattle, Wash., 2006, pp. 2774-2778). Architectures using off-chip memory for signature storage are fundamentally limited by the off-chip memory throughput and additional cost of memory chips. As a result of these limitations of the off-chip storage, on-chip storage has gained attention.
Additionally, due to the high parallelism available on-chip, it is desirable to multiply the signature detection throughput by replicating the DPI data structures on a single chip to allow parallel detection. Unfortunately, available on-chip storage is limited. This limitation forces highly space-optimized data structures, and the desired parallelism further constrains storage. Finally, the data structure should be simple enough to allow rapid detection to achieve high throughput to support tomorrow's line rates and large signature databases. Hence, considering the strict storage constraint and high-speed requirements, low-cost DPI is a continuing challenge.
One reason that DPI consumes a large portion of NIDPS processing power is that the intrusion signatures can appear anywhere in a packet payload. To address this issue, in DPI, each packet payload is searched using a sliding window that slides one byte at a time. The window content is compared against the intrusion signatures to detect malicious activity. In the worst-case, this requires a comparison of the window content against all the signatures for each and every byte offset from the packet payload. Using hash functions, the number of possible matches to the window content can be reduced to a few possible signatures. The window content is then compared to these signatures to verify if there is a match.
Although ordinary hash functions have a good average case speed especially when the memory utilization is low, they cannot guarantee the number of possible matches in the worst-case due to hash collisions. This is made worse if the memory utilization is high.
A Perfect Hash Function (“PHF”) is a special type of hash function that eliminates all the collisions. To optimize both speed and storage, a minimal PHF (“MPHF”), which maps a given set S of n keys into exactly m=n memory slots without any collisions, can be used. Note that “keys” or “items” are used interchangeably and are meant to have the same meaning in the following paragraphs for purposes of explanation and illustration. In addition to the memory to store the keys, a hash function needs additional storage for its own representation. The information theoretical lower bound to represent an MPHF is approximately 1.4427n bits. (See, e.g., F. C. Botelho, R. Pagh, and N. Ziviani, “Simple and space-efficient minimal perfect hash functions,” in WADS, 2007, pp. 139-150.) In the paper Y. Lu, B. Prabhakar, and F. Bonomi, “Perfect Hashing for Network Applications,” in IEEE Symposium on Information Theory, Seattle, Wash., 2006, pp. 2774-2778, an efficient MPHF construction is given based on Bloom Filters (See, e.g., B. Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors,” Communications of the ACM, vol. 13, no. 7, 1970), which requires 8.6n bits for representing the MPHF in practice. This approach, however, requires a complex addressing scheme for queries, where additional logic is required to calculate the address in the hash table.
U.S. patent application Ser. No. 11/978,216 (referred to as the “Generating a Hierarchical Data Structure Associated With a Plurality of Known Arbitrary-Length BIT Strings Used For Detecting Whether An Arbitrary-Length Bit String Input Matches One of a Plurality Of Known Arbitrary-Length Bit Strings application” and incorporated herein by reference) describes a trie-based framework called TriBiCa (Trie Bitmap Content Analyzer). (See also, N. S. Artan and H. J. Chao, “TriBiCa: Trie Bitmap Content Analyzer for High-Speed Network Intrusion Detection,” in 26th Annual IEEE Conference on Computer Communications (INFOCOM), 2007, pp. 125-133). In this framework, the algorithm gradually decides on which key to compare the packet payload among a set of keys. For each query the algorithm provides at most one match candidate. The keys can be whole or partial signatures or some other information regarding signatures. The algorithm starts with n keys at the root node of a trie and it partitions the keys into two equal-sized groups (each group with n/2 keys). Then, each of these new groups is placed into one of the child nodes of the root node and each new group is partitioned into two equal-sized groups (each group with n/4 keys). This partitioning is repeated recursively until there are n nodes each with one key.
To query for a key, the algorithm traverses the trie until a single candidate key is pointed at a leaf node. When the single key is found, only one comparison (that is, a comparison of the queried key and the candidate key) is needed to decide whether the queried key is actually the same as the candidate key. Based on the TriBiCa framework, a low-cost high-speed DPI architecture that requires a single commodity FPGA to do inspection at 10-Gbps has been proposed. (See, e.g., N. S. Artan, R. Ghosh, Y. Guo, and H. J. Chao, “A 10-Gbps High-Speed Single-Chip Network Intrusion Detection and Prevention System,” in 50th Annual IEEE Global Communications Conference, GLOBECOM 2007, Washington, D.C., November 2007.) However, it would be useful to make the trie data structures described in the Generating a Hierarchical Data Structure Associated With a Plurality of Known Arbitrary-Length BIT Strings Used For Detecting Whether An Arbitrary-Length Bit String Input Matches One of a Plurality Of Known Arbitrary-Length Bit Strings application more space efficient.
Providing a low-cost and space-efficient MPHF that is simple to construct and suitable for high-speed hardware implementation is desired.