In the past, most network devices were content-unaware; such devices extracted only transportation information contained in Layer 3 (L3) and Layer 4 (L4) headers such as source IP address and destination port number instead of Layer 7 (L7) packet payload content to manage network traffic and implement network security. The main reason for using content-unaware networking devices is that it is much cheaper and easier to extract L3 and L4 packet header information than it is to extract L7 packet payload content.
However, modern network management now requires networking devices that can extract specific content from within packet payloads. A typical application will require these content-aware devices to extract particular L7 fields. For example, data loss prevention tools (DLP) often extract HTTP fields to detect covert data channels. Intrusion detection systems rely on L7 field extraction as a primitive operation. Load balancing devices may extract method names and parameters from flows carrying SOAP and XML-RPC traffic and then route the request to the appropriate server that is best able to respond to the request. Finally, existing network monitoring tools such as SNORT and BRO extract 1.7 fields for behavioral analysis.
The problem of online L7 field extraction that occurs within content-aware networking devices is addressed. To do this well, support is needed for automatic translation from grammar representations to automata implementations and automated optimization of the resulting automata implementations. Unfortunately, such automated translation and optimization is difficult because network protocols include features that are not easily represented using standard parsing models such as context-free grammars (CFGs) or regular expressions (Res). For example, the HTTP header field, “Content-Length”, specifies the length of the HTTP body. Unaugmented, a CFG would require a new rule for each legitimate field length, which makes them impractical for L7 parsing.
Online L7 field extraction in a content-aware networking device is fundamentally different than end host protocol parsing because the content-aware network devices must handle millions of concurrent multiplexed network flows. This difference has several technical implications. First, buffering a flow before parsing should be avoided; thus parsing and field extraction should occur incrementally. Second, online L7 field extraction must support efficient context-switching; this requires the parsing state of flows to be minimized. Third, the online L7 field extraction must occur at line-speed.
Prior online L7 field extraction solutions suffer from one of two drawbacks. They are either hand optimized for better performance, or they are derived from an unoptimizable parsing model: recursive descent parsing with code execution. Hand optimized solutions suffer from a high production cost and are prone to errors. The recursive descent solutions offer an excessively rich parsing model that can not be automatically optimized.
Thus, there is no existing solution for online L7 field extraction that supports automated translation from a grammar-based extraction specification to an automata implementation with automated optimization. To illustrate one dimension where previous solutions struggle, the conflict between automated translation and optimization with line-speed extraction is highlighted. One technique that can be exploited to achieve line-speed extraction is to ignore (not parse) unnecessary data; referred to as selective parsing. Previous selective parsing work does achieve high throughput, but these solutions are achieved through hand pruning rather than automated translation and optimization.
This section provides background information related to the present disclosure which is not necessarily prior art.