1. Field of the Invention
This invention is directed to finite state automata and using such finite state automata to recognize patterns in data.
2. Related Art
Deep packet inspection is becoming prevalent for modern networking devices as they inspect packet payloads for a variety of reasons. Payloads are matched against signatures of vulnerabilities to detect or block intrusions. Application recognition, used for accounting and traffic policing, also increasingly relies on inspecting packet contents. Often, load balancing is done based on application layer protocol fields. Implementing such deep packet inspection at the high speeds of today's networks is one of the hardest challenges facing the manufacturers of network equipment. One type of deep packet inspection operation which is critical for intrusion detection and/or prevention systems (IPSes) is signature matching.
The move from simple string based signatures to more complex regular expressions has turned signature matching into one of the most daunting deep packet inspection challenges. To achieve good matching throughput, IPSes typically represent the signatures as deterministic finite automata (DFAs) or nondeterministic or (NFAs). Both types of automata allow multiple signatures to be matched in a single pass. However, DFAs and NFAs present different memory vs. performance tradeoffs. DFAs are fast and large, whereas NFAs are smaller but slower. Measurements by the inventors confirm that DFAs recognizing signature sets currently in use exceed feasible memory sizes, while NFAs are typically hundreds of times slower than DFAs.
Combining the DFAs for each of the individual signatures into a single DFA usable to recognize a signature set often results in a state space explosion. For example, the size of the minimal DFA that recognizes multiple signatures of the type “.*ai.*bi”, where “ai” and “bi” are distinct strings, is exponential in the number of signatures. Less memory suffices when using multiple DFAs because the extent of the state space explosion is smaller. NFAs avoid state space explosion entirely. However, they are less efficient because, for every input byte, the NFA has to track a set of automaton states.
String matching was important for early network intrusion detection systems, as the signatures consisted of simple strings. The Aho-Corasick algorithm, which is disclosed in Alfred V. Aho et. al., “Efficient string matching: An aid to bibliographic search”, Communications of the ACM, June 1975, builds a concise automaton that recognizes multiple different ones of such signatures in a single pass. This automaton is linear in the total size of the strings. Other conventional software and hardware solutions to the string matching problem have also been proposed. Some of these solutions also support wildcard characters. However, evasion techniques made it necessary to use signatures that cover large classes of attacks but still make fine enough distinctions to eliminate false positives. Consequently, the last few years have seen signature languages evolve from exploit-based signatures that are expressed as strings to richer session signatures and vulnerability-based signatures. These complex modern signatures can no longer be expressed as strings or strings with wildcards. As a result, regular expressions are used instead.
NFAs represent regular expressions compactly but may require large amounts of matching time, since the matching operation needs to explore multiple paths in the automaton to determine whether the input matches any signatures. In software, this is usually performed via backtracking, which renders the NFA vulnerable to algorithmic complexity attacks, or by maintaining and updating a “frontier” of states. Unfortunately, both of these solutions can be computationally expensive. While hardware-implemented NFAs can parallelize the processing required and thus achieve high speeds, software implementations of NFAs have significant processing costs. One known NFA hardware architecture efficiently updates the set of states during matching.
DFAs can be efficiently implemented in software. However, DFAs have a state space explosion problem that makes it infeasible to build a DFA that matches all signatures of a complex signature set. On-the-fly determinization has been proposed for matching multiple signatures. This approach maintains a cache of recently visited states and computes transitions to new states as necessary during inspection. This approach has good performance in the average case, but can be subverted by an adversary who can repeatedly invoke the expensive determinization operation.
Fang Yu, et al., “Fast and memory-efficient regular expression matching for deep packet inspection”, Technical Report EECS-2005-8, U.C. Berkeley, 2005 (hereafter “Yu”), discloses a solution that offers intermediary tradeoffs by using multiple DFAs to match a signature set. Each of these DFAs recognizes only a subset of the signatures and all DFAs need to be matched against the input. Yu proposes combining signatures into multiple DFAs instead of one DFA, using simple heuristics to determine which signatures should be grouped together. This approach reduces the total memory footprint, but for complex signature sets, the number of resulting DFAs can itself be large. Additionally, this approach results in increased inspection time, since payloads must now be scanned by multiple DFAs.
The D2FA technique disclosed in S. Kumar, et al., “Algorithms to accelerate multiple regular expressions matching for deep packet inspection”, Proceedings of the ACM SIGCOMM, September 2006, addresses memory consumption by reducing the memory footprint of individual states. It stores only the difference between transitions from similar states, and relies on some hardware support for fast matching. The technique does not address the state space explosion problem.
Other extensions to automata have been proposed. In the context of XML parsing, Michael Spergberg-McQueen, “Notes on finite state automata with counter”, www.w3.org/XML/2004/05/msm-cfa.html, (available at least as early as May 2004), discloses counter-extended finite automata (CFAs). CFAs are nondeterministic and have the same performance problems as NFAs. Extended Finite State Automata (EFSA) that extend a traditional finite state automaton with the ability to assign and examine values of a finite set of variables have been used in the context of information security. EFSAs have been used to monitor a sequence of system calls, and to detect deviations from expected protocol state for VoIP applications. Extensions, such as EFSA, fundamentally broaden the language recognized by the finite-state automata, i.e., EFSAs correspond to regular expression for events (REEs).