1. Field of the Invention
The invention relates generally to methods and systems for performing pattern matching on digital data. More particularly, the invention involves systems and methods for processing regular expressions.
2. Description of the Related Art
With the maturation of computer and networking technology, the volume and types of data transmitted on the various networks have grown considerably. For example, symbols in various formats may be used to represent data. These symbols may be in textual forms, such as ASCII (American Standard Code for Information Interchange), EBCDIC (Extended Binary Coded Decimal Interchange Code), the fifteen ISO 8859, 8 bit character sets, or Unicode multi-byte character encodings such as, UTF-8 or UTF-16, for example. Data may also be stored and transmitted in specialized binary formats representing executable code, sound, images, and video, for example.
Along with the growth in the volume and types of data used in network communications, a need to process, understand, and transform the data has also increased. For example, the World Wide Web and the Internet comprise thousands of gateways, routers, switches, bridges, and hubs that interconnect millions of computers. Information is exchanged using numerous high level protocols like SMTP (Simple Mail Transfer Protocol), MIME (Multipurpose Internet Mail Extensions), HTTP (Hyper Text Transfer protocol), and FTP (File Transfer Protocol) on top of low level protocols like TCP (Transport Control Protocol), UDP (User Datagram Protocol), IP (Internet Protocol), MAP (Manufacturing Automation Protocol), and TOP (Technical and Office Protocol). The documents transported are represented using standards like RTF (Rich Text Format), HTML (Hyper Text Markup Language), XML (eXtensible Markup Language), and SGML (Standard Generalized Markup Language). These standards may further include instructions in other programming languages. For example, HTML may include the use of scripting languages like Java and Visual Basic.
As information is transported across a network, there are many points at which some of the information may be interpreted to make routing decisions. To reduce the complexity of making routing decisions, many protocols organize the information to be sent into a protocol specific header and an unrestricted payload. At the lowest level, it is common to subdivide the payload into packets and provide each packet with a header. In such a case (e.g., TCP/IP), the routing information required is at fixed locations, where relatively simple hardware can quickly find and interpret it. Because these routing operations are expected to occur at wire speeds, simplicity in determining the routing information is preferred. However, as discussed further below, a number of factors have increased the need to look more deeply inside packets to assess the contents of the payload in determining characteristics of the data, such as routing information.
Today's Internet is rife with security threats that take the form of viruses and denial of service attacks, for example. Furthermore, there is much unwanted incoming information sent in the form of SPAM and undesired outgoing information containing corporate secrets. There is undesired access to pornographic and sports web sites from inside companies and other organizations. In large web server installations, there is the need to load balance traffic based on content of the individual communications. These trends, and others, drive demand for more sophisticated processing at various points in the network and at server front ends at wire speeds and near wire speeds. These demands have given rise to anti-virus, intrusion detection and prevention, and content filtering technologies. At their core, these technologies depend on pattern matching. For example, anti-virus applications look for fragments of executable code and Java and Visual Basic scripts that correspond uniquely to previously captured viruses. Similarly, content filtering applications look for a threshold number of words that match keywords on lists representative of the type of content (e.g., SPAM) to be identified. In like manner, enforcement of restricted access to web sites is accomplished by checking the URL (Universal Resource Locator) identified in the HTTP header against a forbidden list.
Once the information arrives at a server, having survived all the routing, processing, and filtering that may have occurred in the network, it is typically further processed. This further processing may occur all at once when the information arrives, as in the case of a web server. Alternatively, this further processing may occur at stages, with a first one or more stages removing some layers of protocol with one or more intermediate forms being stored on disk, for example. Later stages may also process the information when the original payload is retrieved, as with an e-mail server, for example.
In the information processing examples cited above, the need for high speed processing becomes increasingly important due to the need to complete the processing in a network and also because of the volume of information that must be processed within a given time.
Regular expressions are well known in the prior art and have been in use for some time for pattern matching and lexical analysis. An early example of their use is disclosed by K. L. Thompson in U.S. Pat. No. 3,568,156, issued Mar. 2, 1971, which is hereby incorporated by reference in its entirety.
Regular expressions typically comprise terms and operators. A term may include a single symbol or multiple symbols combined with operators. Terms may also be recursive, so a single term may include multiple terms combined by operators. In dealing with regular expressions, three operations are defined, namely, juxtaposition, disjunction, and closure. In more modern terms, these operations are referred to as concatenation, selection, and repetition, respectively. Concatenation is implicit, one term is followed by another. Selection is represented by the logical OR operator which may be signified by a symbol, such as ‘|’. When using the selection operator, either term to which the operator applies will satisfy the expression. Repetition is represented by ‘*’ which is often referred to as a Kleene star. The Kleene star, or other repetition operator, specifies zero or more occurrences of the term upon which it operates. Parentheses may also be used with regular expressions to group terms.
In order to implement a regular expression in software or hardware, a regular expression may be converted to a Deterministic Finite-state Automata (“DFA”), which is a finite state machine that defines state transitions for processing a regular expression, where for each pair of current state and input symbol there is one and only one transition to a next state. Because a DFA includes only one possible transition for any combination of current state and input symbol, DFAs are desirable for implementation of regular expression functionality in software or hardware. However, as described in further detail below, when certain regular expressions are converted to corresponding DFAs, the size of the DFAs are very large, to the point that some regular expression functions produce DFAs that are too large to be executed on existing computing systems. Accordingly, improved systems and methods for processing regular expressions that include such functions are desired.
In order to convert a regular expression to a DFA, the regular expression is typically first converted to a nondeterministic finite-state automata (“NFA”), which is a finite state machine where for each pair of current state and input symbol there may be several possible next states. Thus, NFAs involve a searching process in which many trials fail, and backtracking is required for each failure. Accordingly, NFAs are typically not suitable for high speed applications. There are, however, regular expression parsers that work using NFAs, but they are generally slower than those that use DFAs due to the fact that a DFA has a unique path for each accepting string. Backtracking in DFAs is minimal and only occurs when nothing matches or as a result of trying to match the longest string possible. In the latter case, if a longer match fails, the input text that is not included in a shorter match is reprocessed. As those of skill in the art will recognize, after converting a regular expression to a NFA, the NFA may be converted to a DFA and implemented in software or hardware. Thus, the process of converting a regular expression to a DFA typically involves a two step process including converting the regular expression to a NFA and then converting the NFA to a DFA.