With the emergence of All-IP, Fixed and Mobile Convergence (FMC) and Triple-play concepts, the traditional IP network is changing towards a uniform bearer network for all services including data, voice and video.
However, the inherent data transfer mode and open nature of IP networks do not meet the needs of carrier-class services well. Improvements are yet to be made in network security, manageability and assurance of Quality of Service (QoS) and Quality of Experience (QoE) for key services. For precise identification and control of certain key services, in addition to the traditional analysis of the quintuplet field in the packet header, it is also necessary to inspect the packet load. For example, some Peer-to-Peer (P2P) traffic uses unknown ports and it will be impossible to identify the traffic class by merely analyzing the quintuplet. Deep Packet Inspection (DPI) is a flexible and effective service identification technology and is widely applied to firewalls, Intrusion Detection Systems/Intrusion Prevention Systems (IDSs/IPSs), and service control gateways to implement application layer load balancing and feature-based security filtering. Unlike previous packet inspection, DPI not only analyzes protocols below layer 4 (IP layer) in the TCP/IP model but also inspects information above layer 4. Therefore, DPI provides richer information but the processing is also more complex.
A traditional DPI system compares the packet load with a preset set of character strings to judge whether the packet meets the specific features. In recent years, more and more systems are using regular expressions to replace character strings for description of packet features. Compared with character strings, regular expressions can describe features flexibly, easily and effectively so that a feature string is dynamic and adaptable to various dynamic searches. For example, features described by character strings b, ab, aab, aaab, and aaaab can be expressed with one simple regular expression a*b. Different regular expression language specifications have different descriptive capabilities. Popular regular expression specifications in the industry include Portable Operating System Interface (POSIX) and Perl Compatible Regular Expression (PCRE). PCRE includes some extensions that POSIX does not support and therefore its descriptive capability is more powerful. For example, Snort uses PCRE to describe some of its rules. At present, POSIX is used by most devices. Some devices are claimed to support PCRE but their product information proves they support only a PCRE subset that is compatible with POSIX. They do not support the complex PCRE syntaxes in a real sense.
The action to check whether the packet content contains rules described by regular expressions is called regular expression matching. The following describes some popular regular expression matching methods.
One method is based on a Deterministic Finite Automaton (DFA), where regular expressions are converted in advance into transition tables described in a certain form. During a matching process, symbols in a packet are used as an input condition for querying the transition table so as to determine the next transition state. The merit of this method is the easy “table query-transition” operation mode, which is convenient to implement by hardware and the matching is quick. The weakness of this method is the support for only simple regular expression specifications, and many extensions in PCRE, including condition expressions, ^, $ and other location related symbols, and matching options, are not well supported. In addition, when some regular expressions such as, *AB. {j} CD are expressed in the DFA, the number of their states will increase exponentially with the length of the regular expressions, which imposes a great pressure on storage.
Another method is based on a Non-deterministic Finite Automaton (NFA). The basic principle thereof is similar to that of a DFA and the difference is that the NFA allows empty symbols (it supports transition when no symbol is inputted) and the input of one symbol in the NFA may activate multiple next transition states. Such non-deterministic nature causes difficulty in implementation by hardware. There are some methods for implementing the NFA by hardware but the NFA transition tables are directly implemented by programmed logical devices. In this case, when the regular expressions are updated, the programmed logical devices need to be updated so that the scalability is poor.
A third method is program parsing. This method normally does not generate transition tables. Taking the PCRE source code library as an example, regular expressions are parsed into minimum segments understandable to a program. In a matching process, the program places symbols in a packet into different segments according to their locations. If a symbol matches one segment, the program waits for the next symbol in the segment or enters the next segment. Some software solutions may adopt other processing modes but can still be categorized into program parsing. This method normally has good scalability and supports complex regular expression syntaxes. But, in comparison with a state machine, its matching speed is lower so that it is liable to become a bottleneck of the entire system.
A multi-layer packet filtering architecture is provided in a first embodiment of the conventional art, where different filtering standards may be defined for filters at different layers. Packets after layer-1 filtering are sent to the layer-2 filter and so on until a filtering result is concluded. The first embodiment of the conventional art is applicable to scenarios where different layers adopt different filtering standards and normally packets are filtered by a preceding layer to reduce the data processing of the subsequent layer. Because the layers are in a strict sequence, to improve the total efficiency, filters of more complex processing should be placed more faraway. The first embodiment of the conventional art has the following weaknesses: because filters at different layers are in a restrict sequence, the flexibility is low and the delay in packet processing will be longer; for example, some packets do not need layer 1 filtering at all, but according to the multilayer filtering architecture, these packets need to pass through the layer 1 filter before entering the layer 2 filter for matching, which on the one hand, wastes the processing capability of the layer 1 filter and on the other hand, prolongs the time for packet matching. The first embodiment of the conventional art gives only general definitions of filtering standards without specifying a solution specific to regular expressions.
FIG. 1 shows a structure of a regular expression matching system where a Ternary Content Addressable Memory (TCAM) 304 is configured to store DFAs according to a second embodiment of the conventional art. A method for implementing regular expression matching by storing DFAs in TCAM according to the second embodiment of the conventional art includes:
1. A regular expression is divided into multiple sub-expressions by using meta characters (like .*) in the regular expression and store the regular expression in the TCAM 304 (in fact the DFA transition table is stored). The TCAM 304 is able to compare features of multiple characters at a time and work as a state machine with multiple steps.
2. Packet processing actions are stored in a second memory structure 320 which may be a Random Access Memory (RAM).
3. A pre-parser, the front-end analyzing program 334 in the figure, extracts fields to process from a packet cache 258 and stores the fields in a message cache 306.
4. The message cache 306 is configured to store packet messages that need regular expression matching.
5. A decoding circuit 302 decodes the packets and executes commands related to the packets.
6. Under control of the decoding circuit 302, a barrel shifter 308 extracts the different part from the packets and sends it together with the current state stored in a tag space 318 to the TCAM 304 for comparison.
7. Once the TCAM 304 inspects a regular expression, the corresponding action stored in the second memory structure 302 will be executed by the decoding circuit, which generates a signal and sends the signal to a traffic controller 352.
The second embodiment of the conventional art has the following weaknesses: the implementation of the technical solution under the second embodiment of the conventional art is closely related to the fact that the TCAM can simultaneously access multiple characters and therefore the solution is not universally applicable; the high price and high power consumption of the TCAM also limit its use; to reduce the number of sub-expressions so as to reduce occupied TCAM entries, multiple DFAs need to be combined into one DFA so that it is necessary to recompile the entire DFA once the regular expression is updated; the second embodiment of the conventional art does not provide good support for many extensions in PCRE.
In a third embodiment of the conventional art, a special regular expression inspection module is adopted in the inspection system. When it is necessary to inspect regular expressions, traffic is injected to the module for processing. The module uses Application Specific Integrated Circuit (ASIC) chips to implement the DFA and now supports POSIX regular expressions. The implementation details are unavailable business secrets. The third embodiment of the conventional art has these weaknesses: the third embodiment of the conventional art does not support PCRE regular expressions; according to analysis, if it is necessary to support PCRE regular expressions, the user needs to provide implementation outside the ASIC chip and no specific implementation solution is yet disclosed.