It is highly desirable to possess the ability to monitor the content of packets transmitted over computer networks. Whether the motivation is to identify the transmission of data files containing material such as copyrighted audio, film/video, software, published articles and book content, to secure confidential data within a company's internal computer system, to detect and eliminate computer viruses, to identify and locate packet transmissions that may be part of a criminal conspiracy (such as e-mail traffic between two persons planning a crime), or to monitor data transmissions of targeted entities, the ability to search packet payloads for strings that match a specified data pattern is a powerful tool in today's electronic information age. Further, the ability to modify the data stream permits the system to, among other things, filter data, reformat data, translate between languages, extract information, insert data, or to notify others regarding the content.
String matching and pattern matching have been the subject of extensive studies. In the past, software-based string matching techniques have been employed to determine whether a packet payload includes a data pattern. However, such software-based techniques are impractical for widespread use in computer networks because of the inherently slow packet processing speeds that result from software execution.
For example, U.S. Pat. No. 5,319,776 issued to Hile et al. (the disclosure of which is hereby incorporated by reference) discloses a system wherein data in transit between a source medium and a destination medium is tested using a finite state machine capable of determining whether the data includes any strings that represent the signatures of known computer viruses. However, because the finite state machine of Hile is implemented in software, the Hile system is slow. As such, the Hile system is impractical for use as a network device capable of handling high-speed line rates such as OC-48 where the data rate approaches 2.5 gigabits per second. Furthermore, software-based techniques are traditionally and inherently orders of magnitude slower than a hardware-based technique.
Another software-based string matching technique is found in U.S. Pat. No. 5,101,424 issued to Clayton et al. (the disclosure of which is hereby incorporated by reference). Clayton discloses a software-based AWK processor for monitoring text streams from a telephone switch. In Clayton, a data stream passing through a telephone switch is loaded into a text file. The Clayton system then (1) processes the content of the text file to determine if particular strings are found therein, and (2) takes a specified action upon finding a match. As with the Hile system described above, this software-based technique is too slow to be practical for use as a high-speed network device.
Furthermore, a software tool known in the art called SNORT was developed to scan Internet packets for combinations of headers and payloads that indicate whether a computer on a network has been compromised. This software program is an Open Source Network Intrusion Detection System that scans packets that arrive on a network interface. Usually, the packets arrive on a media like Ethernet. The program compares each packet with the data specified in a list of rules. If the fields in the header or parts of the payload match a rule, the program performs responsive tasks such as printing a message on a console, sending a notification message, or logging an event to a database. As with the above-described systems, SNORT, by virtue of being implemented in software, suffers from slow processing speed with respect to both its matching tasks and its responsive tasks.
In an effort to improve the speed at which packet payloads are processed, systems have been designed with dedicated application specific integrated circuits (ASICs) that scan packet payloads for a particular string. While the implementation of payload scanning on an ASIC represented a great speed improvement over software-based techniques, such ASIC-based systems suffered from a tremendous flexibility problem. That is, ASIC-based payload processing devices are not able to change the method of searching for the string against which packets are compared because a change in the search string necessitates the design of a new ASIC tailored for the new search string (and the replacement of the previous ASIC with the new ASIC). That is, the chip performing the string matching would have to be replaced every time the search string is changed. Such redesign and replacement efforts are tremendously time-consuming and costly, especially when such ASIC-based systems are in widespread use.
To avoid the slow processing speed of software-based pattern matching and the inflexibility of ASIC-based pattern matching, reprogrammable hardware, such as field programmable gate arrays (FPGAs), have been employed to carry out pattern matching. Such an FPGA-based technique is disclosed in Sidhu, R. and Prasanna, V., “Fast Regular Expression Matching using FPGAs”, IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2001), April 2001 and Sidhu, R. et al., “String Matching on Multicontext FPGAs Using Self-Reconfiguration”, FPGA '99: Proceedings of the 1999 ACM/SIGDA Seventh International Symposium on Field Programmable Gate Arrays, pp. 217–226, February 1999, the entire disclosures of which are hereby incorporated by reference.
The Sidhu papers disclose a technique for processing a user-specified data pattern to generate a non-deterministic finite automata (NFA) operable upon being programmed into a FPGA to determine whether data applied thereto includes a string that matches a data pattern. However, Sidhu fails to address how such a device can also be programmed to carry out a specified action, such as data modification, in the event a matching string is found in the data. Thus, while the Sidhu technique, in using an FPGA to perform pattern matching for a redefinable data pattern, provides high speed through hardware implementation and flexibility in redefining a data pattern through the reprogrammable aspects of the FPGA, the Sidhu technique fails to satisfy a need in the art for a device which not only detects a matching string, but also carries out a specified action upon the detection of a matching string.
Moreover, while the Sidhu technique is capable of scanning a data stream for the presence of any of a plurality of data patterns (where a match is found if P1 or P2 or . . . or Pn is found in the data stream—wherein Pi is the data pattern), the Sidhu technique is not capable of either identifying which data pattern(s) matched a string in the data stream or which string(s) in the data stream matched any of the data patterns.
Unsatisfied with the capabilities of the existing FPGA-based pattern matching techniques, the inventors herein have sought to design a packet processing system able to not only determine whether a packet's payload includes a string that matches a data pattern in a manner that is both high-speed and flexible, but also perform specified actions when a matching string is found in a packet's payload.
An early attempt by one of the inventors herein at designing such a system is referred to herein as the “Hello World Application”. See Lockwood, John and Lim, David, Hello, World: A Simple Application for the Field Programmable Port Extender (FPX), Washington University Tech Report WUCS-00-12, Jul. 11, 2000 (the disclosure of which is hereby incorporated by reference). In the Hello World Application, a platform using reprogrammable hardware for carrying out packet processing, known as the Washington University Field-Programmable Port Extender (FPX) (see FIG. 10), was programmed with a state machine and a word counter designed to (1) identify when a string comprised of the word “HELL” followed by the word “O***” (wherein each * represents white space) was present in the first two words of a packet payload, and (2) when that string is found as the first two words of a packet payload, replace the word “O***” with the word “O*WO” and append the words “RLD.” and “****” as the next two words of the packet payload. The reprogrammable hardware used by the FPX was a field programmable gate array (FPGA). The Hello World Application thus operated to modify a packet with “HELLO” in the payload by replacing “HELLO” with “HELLO WORLD”.
While the successful operation of the Hello World Application illustrated to the inventors herein that the implementation of a circuit in reprogrammable hardware capable of carrying out exact matching and string replacement was feasible, the Hello World Application was not accompanied by any device capable of taking full advantage of the application's reprogrammable aspects. That is, while the FPGA programmed to carry out the Hello World Application was potentially reprogrammable, no technique had been developed which would allow the FPGA to be reprogrammed in an automated and efficient manner to scan packets for a search string other than “HELLO”, or to replace the matching string with a replacement string other than “HELLO WORLD”. The present invention addresses a streamlined process for reprogramming a packet processor to scan packets for different redefinable strings and carry out different redefinable actions upon packets that include a matching string. Toward this end, the present invention utilizes regular expressions and awk capabilities to create a reprogrammable hardware-based packet processor having expanded pattern matching abilities and the ability to take a specified action upon detection of a matching string.
Regular expressions are well-known tools for defining conditional strings. A regular expression may match several different strings. By incorporating various regular expression operators in a pattern definition, such a pattern definition may encompass a plurality of different strings. For example, the regular expression operator “.*” means “any number of any characters”. Thus, the regular expression “c.*t” defines a data pattern that encompasses strings such as “cat”, “coat”, “chevrolet”, and “cold is the opposite of hot”. Another example of a regular expression operator is “*” which means “zero or more of the preceding expression”. Thus, the regular expression “a*b” defines a data pattern that encompasses strings such as “ab”, “aab”, and “aaab”, but not “acb” or “aacb”. Further, the regular expression “(ab)*c” encompasses strings such as “abc”, “ababc”, “abababc”, but not “abac” or “abdc”. Further still, regular expression operators can be combined for additional flexibility in defining patterns. For example, the regular expression “(ab)*c.*z” would encompass strings such as the alphabet “abcdefghijklmnopqrstuvwxyz”, “ababcz”, “ababcqsrz”, and “abcz”, but not “abacz”, “ababc” or “ababacxvhgfjz”.
As regular expressions are well-known in the art, it is unnecessary to list all possible regular expression operators (for example, there is also an OR operator “|” which for “(a|b)” means any string having “a” or “b”) and combinations of regular expression operators. What is to be understood from the background material described above is that regular expressions provide a powerful tool for defining a data pattern that encompasses strings of interest to a user of the invention.
Further, awk is a well-known pattern matching program. Awk is widely used to search data for a particular occurrence of a pattern and then perform a specified operation on the data. Regular expressions can be used to define the pattern against which the data is compared. Upon locating a string encompassed by the pattern defined by the regular expression, awk allows for a variety of specified operations to be performed on the data. Examples of specified operations include simple substitution (replacement), back substitution, guarded substitution, and record separation. These examples are illustrative only and do not encompass the full range of operations available in awk for processing data.
As a further improvement to the Hello World Application, the present invention provides users with the ability to flexibly define a search pattern that encompasses a plurality of different search strings and perform a variety of awk-like modification operations on packets. These features are incorporated into the reprogrammable hardware of the present invention to produce a packet processor having a combination of flexibility and speed that was previously unknown.