1. Field of the Invention
The present invention relates to non-deterministic finite state machines. More particularly, the present invention is related to hardware implementations for non-deterministic finite state machines for simulating processes, such as regular expression matching.
2. Description of the Related Art
Regular expression matching is used in network intrusion detection systems and in information extraction systems. Regular expression matching is computationally challenging and requires high computing power.
A typical way of regular expression matching is to apply the input to a finite state machine representation of the regular expression. A regular expression can be converted into a non-deterministic finite state machine or a deterministic finite state machine using well-established techniques, as e.g. known from J. E. Hopcroft, R. Motwani and J. D. Ullmann, “INTRODUCTION TO AUTOMATA THEORY, LANGUAGES AND COMPUTATION”, Addison-Wesley, 2000.
Furthermore, efficient hardware architectures for programmable deterministic finite state machines are available, as e.g. known from J. van Lunteren, “HIGH-PERFORMANCE PATTERN-MATCHING FOR INTRUSION DETECTION”, Proceedings INFOCOM 2006, pages 1 to 13, 2006, or F. Yu, Z. Chen, Y. Diao, T. V. Lakshman and R. H. Katz, “FAST AND MEMORY-EFFICIENT REGULAR EXPRESSION MATCHING FOR DEEP PACKET INSPECTION”, ANCS 2006, pages 93 to 102, ACM 2006.
Hardware architectures based on reconfigurable non-deterministic finite state machines are also available, as e.g. known from R. Sidhu, V. K. Prasanna, “FAST REGULAR EXPRESSION MATCHING USING FPGAS”, Proceedings FCCM 2001, pages 227 to 238.
However, the above exemplary approaches are severely limited as they usually cannot support start offset reporting and capturing groups. However, start offset reporting, and capturing groups are essential in information extraction systems relying on this information. Existing hardware architectures simply set a flag in case of a regular expression match which only reveals the end offset of a match in the input stream. The start offsets then need to be calculated based on the end offsets if the regular expression has a fixed length. However, regular expressions often include one or more placeholders for none, one or a plurality of characters, so that the overall length of the regular expression is not known.
A naïve approach for start offset reporting is recording the start offset each time the first character or a prefix of a regular expression is matched in the offset stream. However, this is problematic if the first character or the prefix appears multiple times in the regular expression, creating overlaps, where multiple start offset values must be stored at any time by the regular expression matcher and the stored start offset values must be associated with different end offsets eventually.
U.S. Pat. No. 8,190,738 B2 discloses a system and a method for hardware processing of regular expressions. State information associated with one or more states of a state machine is stored in respective memory locations of the memory, wherein the state machine is configured to detect patterns in an input data stream. State information, such as transitions and spin counts updated as characters of an input data stream, is processed. A crossbar is used to interconnect the states stored in the register bank. However, such a crossbar can be very expensive to implement because the number of states in a nondeterministic state machine grows linearly with the number of characters in the associated regular expression.
U.S. Pat. No. 8,051,085 B1 discloses a method and an apparatus for determining the length of one or more substrings of an input string that matches a regular expression. The input string is searched for the regular expression using a non-deterministic finite state machine and, upon detecting a match state, a selected portion of the input string is marked as a match string. The non-deterministic finite state machine is inverted, so that it embodies the inverse of the regular expression. The match string is also reversed and searched for the inverted regular expression using the inverted non-deterministic finite state machine. A counter is incremented for each character processed during the reverse search operation. The current value of the counter each time the match state in the inverted non-deterministic finite state machine is reached indicates the character length of a corresponding substring that matches the regular expression. A disadvantage of such an approach is that the input string has to be scanned twice for each regular expression match, which can significantly reduce the processing rate.
United States Patent Application Publication Number US 2011/0093496 A1 discloses a method for determining whether an input string matches at least one regular expression. Each of the at least one regular expressions is checked for a match between the input string accepted and the given regular expression using the configured nodes of the state machine corresponding to the given regular expression. This includes checking detection events from a simple string detector, submitting queries to identified modules of a variable string detector, and receiving detection events from the identified modules of the variable string detector.
Document A. Majumder, R. Rastogi, S. Vanama, “SCALABLE REGULAR EXPRESSION MATCHING ON DATA STREAMS”, SIGMOD '08, Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Pages 161-172, discloses a regular expression matching system. The system combines the processing efficiency of deterministic finite state machines with the space efficiency of non-deterministic finite state machines to scale to hundreds of regular expressions. This is achieved by caching only the frequent core of each deterministic finite state machine in memory, as opposed to the entire deterministic finite state machine. The regular expressions are clustered such that regular expressions whose interactions cause an exponential increase in the number of states are assigned to separate groups.
In document H. Nakahara et al., “A REGULAR EXPRESSION MATCHING USING NON-DETERMINISTIC FINITE AUTOMATON”, International Conference on Formal Methods and Models for Codesign (MEMOCODE), 2010, 8th IEEE/ACM, 26-28 Jul. 2010, Page(s): 73-76, discloses an implementation of CANSCID (Combined Architecture for Stream Categorization and Intrusion Detection). To satisfy the required system throughput, the packet assembler and the regular expression matching are implemented by hardware while the counting of matching results and the system control are implemented by a microprocessor. A regular expression matching circuit is performed by converting the given regular expressions into a non-deterministic finite state machine and by reducing the number of states. Finally, a finite-input memory machine to detect p-characters is generated, and the matching elements realizing the states are generated.