With the continued proliferation of networked and distributed computer systems, and applications that run on those systems, comes an ever increasing flow and variety of message traffic between and among computer devices. As an example, the Internet and world wide web (the “Web”) provide a global open access means for exchanging message traffic. Networked and/or distributed systems include a wide variety of communication links, network and application servers, sub-networks, and internetworking elements, such as repeaters, switches, bridges, routers and gateways.
Communications between and among devices occur in accordance with defined communication protocols understood by the communicating devices. Such protocols may be proprietary or non-proprietary. Examples of non-proprietary protocols include X.25 for packet switched data networks (PSDNs), TCP/IP for the Internet, a manufacturing automation protocol (MAP), and a technical & office protocol (TOP). Other proprietary protocols may be defined as well. For the most part, messages are comprised of packets, containing a certain number of bytes of information. The most common example is Internet Protocol (IP) packets, used among various Web and Internet enabled devices.
A primary function of many network servers and other network devices (or nodes), such as switches, gateways, routers, load balancers and so on, is to direct or process messages as a function of content within the messages' packets. In a simple, rigid form, a receiving node (e.g., a switch) knows exactly where in the message (or its packets) to find a predetermined type of contents (e.g., IP address), as a function of the protocol used. Typically, hardware such as switches and routers are only able to perform their functions based on fixed position headers, such as TCP or IP headers. No deep packet examination is done.
Software, not capable of operating at wire speed is sometimes used for packet payload examination. This software does not typically allow great flexibility in specification of pattern matching and operates at speeds orders of magnitude slower than wire rate. It is highly desirable to allow examination and recognition of patterns both in packet header and payload described by regular expressions. For example, such packet content may include address information or file type information, either of which may be useful in determining how to direct or process the message and/or its contents. The content may be described by a “regular expression”, i.e., a sequence of characters that often conform to certain expression paradigms. As used herein, the term “regular expression” is not limited to any particular language or operating system and it is used in a broad sense. A regular expression may be written in any of a variety of codes or languages known in the art, e.g., Perl, Python, Tcl, grep, awk, sed, egrep or POSIX expressions. Regular expressions may be better understood with reference to Mastering Regular Expressions, J. E. F. Friedl, O'Reilly, Cambridge, 1997.
The ability to match regular expressions would be useful for content based routing. For matching regular expressions, a deterministic finite automaton (DFA) or non-deterministic finite automaton (NFA) could be used. The approach used by the present invention follows a DFA approach. A conventional DFA requires creation of a state machine prior to its use on a data (or character) stream.
Generally, a DFA processes an input character stream sequentially and makes a state transition based on the current character and current state. This is a brute-force, single byte at a time, conventional approach. By definition, a DFA transition to a next state is unique, based on current state and input character. For example, in prior art FIG. 1A, a DFA state machine 100 is shown that implements a regular expression “binky.*\.jpg”. DFA state machine 100 includes states 0 through 9, wherein the occurrence of the characters 110 of the regular expression effect the iterative transition from state to state through DFA state machine 100. The start state of the DFA state machine is denoted by the double line circle having the state number “0”. An ‘accepting’ state indicating a successful match is denoted by the double line circle having the state number “9”. As an example, to transition from state 0 to state 1, the character “b” must be found in the character stream. Given “b”, to transition from state 1 to state 2, the next character must be “i”.
Not shown explicitly in FIG. 1A are transitions when the input character does not match the character needed to transition to the next state. For example, if the DFA gets to state 1 and the next character is an “x”, then failure has occurred and transition to a failure state occurs. FIG. 1B shows part 150 of FIG. 1A drawn with failure state transitions, wherein a failure state indicated by the “Fail” state. In FIG. 1B, the tilde indicates “not”. For example, the symbol “˜b” means the current character is “not b”. Once in the failure state, all characters cause a transition which returns to the failure state.
Once in the accepting state, i.e., the character stream matches “binky.*\.jpg”, the receiving node takes the next predetermined action. In this example, where the character stream indicates a certain file type (e.g., “.jpg”), the next predetermined action may be to send the corresponding file to a certain server, processor or system.
While such DFAs are useful, they are limited with respect to speed. The speed of a conventional DFA is limited by the cycle time of memory used in its implementation. For example, a device capable of processing the data stream from an OC-192 source must handle 10 billion bits/second (i.e., 10 gigabits per second, Gbps). This speed implies a byte must be processed every 0.8 nanosecond (nS), which exceeds the limit of current state of the art memory. For comparison, current high speed SDRAM chips implementing a conventional DFA operate with a 7.5 nS cycle time, which is ten times slower than required for OC-192. In addition, more than a single memory reference is typically needed, making these estimates optimistic. As a result, messages or packets must be queued for processing, causing unavoidable delays.
Co-pending application Ser. No. 10/005462 filed Dec. 3, 2001 describes a real time high speed parallel byte pattern recognition system which has relatively low memory storage requirements. The system shown in co-pending application Ser. No. 10/005462 filed Dec. 3, 2001 can be termed a Real-time Deterministic Finite Automaton (hereinafter RDFA). The RDFA is capable of regular expression matching at high speed on characters presented in parallel. The characters may be supplied to the RDFA in serial or parallel; however, the RDFA operates on the characters in parallel. For example, four characters at a time may arrive simultaneously or the four characters may be streamed into a register in the RDFA serially; however, in either case, the RDFA operates on the characters in parallel. In the interest of completeness, the RDFA described in co-pending application Ser. No. 10/005462 filed Dec. 3, 2001 is also described herein.
An RDFA system includes a RDFA compiler subsystem and a RDFA evaluator subsystem. The RDFA compiler generates a set of tables which are used by the RDFA evaluator to perform regular expression matching on an incoming data stream. The present invention is direct to the compiler subsystem which generates the sets of tables.
In the following description the term “n-closure list” means a list of states reachable in n-transitions from the current state. The term “alphabet transition list” means a list of the transitions out of a particular state for each of the characters in an alphabet.