Continuing trends in computing and communications lead to the emergence of environments that abound in content analytics and processing. Exemplary fields that typically require such high performance content analytics and processing include content-aware networking, content-based security systems, surveillance, distributed computing, wireless communication, information storage and retrieval systems, and many others.
The computer and communications environments used in such fields will require high levels of content analysis and processing. Such environments will need efficient and programmable solutions for stateful and contextual inspection, searching, lexical analysis, parsing, characterization, interpretation, filtering, and transformation of content in messages, documents, or packets. Central to these content processing functions is the ability to efficiently evaluate state machines against an input data stream.
State machines, which are central to the theory of computation, are formal models that consist of states, transition amongst states, and an input representation, in their simplest formulation. In the 1950s, the regular expression (RE) was developed by Kleene as a formal notation to describe and characterize sets of strings. The finite state automation was developed as a state machine model that was found to be equivalent to the RE. Non-deterministic automata (NFA) were developed and shown to be equivalent to deterministic automata. Subsequent work by Thompson and others led to a body of construction algorithms for constructing finite state automata to evaluate an RE. A large number of references are available for descriptions of Regular Expressions and Finite State Automata. For a reference text on the material, see “Speech and Language Processing” (by Daniel Jurafsky and James H. Martin, Prentice-Hall Inc, 2000). The RE has evolved into a powerful tool for pattern matching and recognition, and the finite state automation has become the standard technique to implement a machine to evaluate it.
State machine and finite state automata processing is typically performed in one of three ways. First, such processing has been performed by implementing a fixed and chosen state machine that is known a priori. This may be effected using a fixed application specific integrated circuit (ASIC) solution. This approach can increase performance, but lacks programmability. Moreover, the expense of such implementation is often prohibitive.
Second, state machines may be realized in a programmable manner using Field Programmable Gate Arrays (FPGA). The FPGA architecture provides generalized programmable logic that can be configured for a broad range of applications. However, this approach can only accommodate a small number of state machines on a chip and the rate at which the evaluation can progress is limited. Therefore, this approach is inadequate for the broad range of emerging applications.
Third, a variety of state machines may be implemented using conventional general-purpose microprocessors. Because microprocessors are fully programmable, this approach is able to address evolving requirements, but microprocessors have several limitations in regard to evaluating state machines.
FIG. 1(a) illustrates the limitations of the microprocessor-based approach when implementing a Finite State Automata (FSA). Two implementation options exist, the Deterministic Finite State Automata (DFA) approach, and the Non-deterministic Finite State Automata (NFA) approach. The two approaches are compared on their ability to implement an R-character RE and evaluate it against N bytes of an input data stream. In either approach, the RE is mapped into a state machine or a finite state automata with a certain number of states. The amount of storage required to accommodate these states is one metric used to evaluate a microprocessor-based solution. A second metric is the total time needed to evaluate the N-byte input data stream.
For the DFA approach, the bound on the storage required for an R-character RE is 2R. Hence, a very large amount of storage may be required to accommodate the states. A DFA is typically implemented by building a state transition table in memory, and having the microprocessor sequence through the table as it progressively evaluates the input data. The large size of the state transition table renders the cache subsystem in typical commercial microprocessors ineffective and requires that the microprocessor access external memory to lookup the table on every fresh byte of the input data in order to determine the next state. Thus, the rate at which the state machine can evaluate input data is limited by the memory access loop. This is illustrated in FIG. 1(b). For N bytes of input stream, the time taken to evaluate the state machine is proportional to N accesses of memory. Typical systems have memory access latencies of approximately 100 nanoseconds (ns). This limits the data rate that can be evaluated against the state machine to approximately 100 Mbps.
To evaluate multiple REs in parallel, one option is to implement the REs in distinct tables in memory, with the microprocessor sequentially evaluating them one after the other. For K parallel REs, the evaluation time would be approximately K*N*100 ns, while the bound on storage would grow to K*2R. Another alternative is to compile all of the REs into a single DFA and have the microprocessor sequence through the table in a single pass. For K parallel Res, the bound on storage would grow to 2(K*R), while the evaluation time would remain N*100 ns. The storage needed for such an approach could be prohibitive. To implement a few thousand REs, the storage needed could exceed the physical limits of memory for typical commercial systems.
For the NFA approach, the bound on the storage required for an R-character RE is proportional to R. Hence, storage is not a concern. However, in an NFA, multiple nodes could make independent state transitions simultaneously, each based on independent evaluation criteria. Given that the microprocessor is a scalar engine, which can execute a single thread of control in sequential order, the multiple state transitions of an NFA require that the microprocessor iterate through the evaluation of each state sequentially. Hence, for every input byte of data, the evaluation has to be repeated R times. Given that the storage requirements for the scheme are modest, all the processing could be localized to using on-chip resources, thus remaining free of the memory bottleneck. Each state transition computation is accomplished with on-chip evaluation whose performance is limited by the latency of access of data from the cache and the latency of branching. Since typical microprocessors are highly pipelined, the performance penalty incurred due to branching is significant. For example, assuming a 16-cycle loop for a typical commercial microprocessor running at 4 GHz, the evaluation of a single state transition could take on the order of 4 ns. Thus, evaluating an N-byte input stream against an R-state NFA for an R-character RE would require N*R*4 ns. For K parallel REs, the microprocessor would sequence through each, taking K*N*R 4 ns. So, for just 4 parallel Es with 8 states each, the data rate would again be limited to approximately 100 Mbps. These examples indicate that typical conventional microprocessors can deliver programmable state machine evaluation on input data rates of approximately 100 Mbps. However, in the short term, data rates of between 1 Gbps and 10 Gbps will not be uncommon in enterprise networks and environments. While it may be possible to employ multiple parallel microprocessors to execute some of the desired functions at such rates, such an approach would greatly increase system costs.
These data points indicate that the conventional microprocessor of 2003 or 2004 will be able to deliver programmable state machine evaluation on input data at rates around the 100 Mbps range. However, in this timeframe, data rates of between 1 Gbps to 10 Gbps will not be uncommon in enterprise networks and environments. Clearly, there is a severe mismatch of one to two orders of magnitude between the performance that can be delivered by the conventional microprocessor and what is demanded by the environment. While it is possible to employ multiple parallel microprocessor systems to execute some of the desired functions at the target rate, this greatly increases the cost of the system. There is clearly a need for a more efficient solution for these target functions.