Field
Embodiments of the present invention generally relate to micro-programmed parallel data processing systems and methods thereof for determining a number of consecutive matches to a defined pattern exist within a given data stream. More particularly, embodiments of the present invention relate the use of a parallel computation of a number of consecutive members of a particular class definition exist within a data stream on which pattern matching, overflow pattern matching and/or regular expression matching is being performed
Description of the Related Art
Many processing tasks require regular expression matching, pattern matching and/or other sequential comparisons of an input data stream to make important decisions, including, but not limited to, identification of undesired packets in the context of network security applications. Generally, computing devices performing such tasks, such as image processing devices, wireless signal processing devices and network security devices, count the number of consecutive matches found in a data stream to a particular pattern or patterns to make certain decisions. Counting the number of consecutive bits (e.g., zeros or ones) has been a subject of interest for quite some time; however, most of the existing solutions rely on sequential processing of the input data stream in order to determine the number of consecutive bits having a value of one or zero. With present day applications such as bioinformatics, medical diagnostics, molecular biology, traffic analysis, data compression, etc., where the associated data streams are typically very long, a sequential approach for determining the count of consecutive bits having a particular value or consecutive segments matching a particular pattern may take an inordinate amount of time.
A non-limiting example of an application in which a number of consecutive bits (e.g., representing Boolean results of a comparison of a data stream with a pattern) is used for decision-making includes an intrusion detection system (IDS) and/or an intrusion prevention system (IPS), which typically operates by collating and analysing network traffic flowing through a network. Software/hardware filters of various kinds are currently used as tools for detecting intrusion in both network-based and/or host-based computer systems. Network intrusion or attempts to intrude into a network can be detected through monitoring and analysis of the network traffic through software/hardware filters, where these filters are designed to identify predetermined traffic patterns of interest (also referred to as signatures) and generate alerts, block and/or quarantine suspicious traffic. For example, when a set of observed network activity is compared against a signature, a set of Boolean outputs may be produced in which a zero indicates a mismatch and a one indicates a match. Similarly, signatures can be generated to identify files or email communications containing certain known and/or unknown types of malware or malicious content. As such, in order to assess the degree of similarity or whether a particular data stream or set of observations matches a particular signature, network security systems, which may perform one or more of antivirus processing, spam detection, intrusion detection and the like, typically include a process that involves counting the number of consecutive bits (e.g., in the set of Boolean outputs) that have a particular value. This counting of consecutive bits having a particular value is common in a variety of types of pattern matching, and regular expression matching. Those skilled in the art will appreciate sequential processing of a data stream to update consecutive class counters, overflow-patterns and the like can create a performance bottleneck, thereby delaying downstream decision-making processes relying on the results of such sequential processing.
Another example in which the count of consecutive bits is used includes data compression. For example, a Run Length Encoding (RLE) process typically counts the number of consecutive bits (1 or 0) to store a data stream in compressed form. An input data stream is typically stored within and transmitted among computing devices in the form of zeros and ones, wherein in the context of a simplified example, data representing a black and white image may be divided into numerous pixels, and each pixel can be designated as either ON (1) or OFF (0), wherein such a method of storing images (for example) can consume a large amount of memory due to the large number of pixels in each image. Therefore, it is often desirable to store a representation of the image in a more compact/compressed form. Because each horizontal line of pixels often includes long strings of consecutive pixels that are ON or OFF, a representation of the image can be stored in a more compact form by storing the number of consecutive zeros followed by the number of consecutive ones, and so on.
There are numerous other applications, such as image processing, sequence alignment, string comparison, signature matching, lexicographic sequence analysis, phonetic sequences analysis, signal analysis and micro-array data analysis and the like, in which the count of consecutive bits or consecutive data segments having a particular property (e.g., meeting a particular class definition, matching a particular character set or consecutive alignment) plays an important role.
There is therefore a need for methods and systems for counting consecutive values using a parallel processing approach having the right balance of hardware acceleration and efficient software logic that is both scalable and provides a fast solution