Fast search techniques are needed in many computing and network applications such as search engines and network addressing. Regular search of a string in a dictionary of strings of fixed sizes is rather simple, using for example binary search. With a dictionary of variable-size strings, the matching process becomes more intricate. A string of arbitrary size in which each character is uniquely defined in an alphabet is colloquially called an “exact string”. A string of arbitrary size in which at least one character may be replaced without changing the purpose of the string is colloquially called an “inexact string”. The search for an inexact string is complicated. For example searching for a name such as “John Winston Armstrong” in a dictionary of names is much simpler than searching for any name in the dictionary that contains a string such as “J . . . ton.Arm”, where ‘.’ may represent any of a subset of characters in the alphabet. In the latter, each of a large number of strings such as “Jane Clinton-Armbruster” and “Jack Newton Armstrong” is considered a successful match.
Numerous software-based techniques, suitable for implementation in a general-purpose computer, for fast matching of exact strings in which each character is uniquely defined and corresponds to a pre-defined alphabet are known. The Aho-Corasick algorithm, for example, is known to be computationally efficient and may be used in real-time applications, see, e.g., a paper by Alfred V. Aho and Margaret J. Corasick “Efficient String Matching: An Aid to Bibliographic Search” published in the Communications of the ACM, June 1975, Volume 18, Number 06, p. 333-340. Software-based techniques for matching “inexact strings” are also known, but are too slow for certain real-time applications such as network security applications which require fast execution, see, e.g., a paper by Ricardo A. Baeza-Yates and Gaston H. Connet “A New Approach to Text Searching” published in Communications of the ACM, 35, October 1992, p. 74-82.
Regular Expressions, as described, for example, in the paper written by Ken Thompson “Regular Expression Search Algorithm” published in Communications of the ACM, Vol. 11, Number 6, June 1968, p. 419-422 are commonly used for representing inexact strings. Regular expressions can be implemented efficiently using special-purpose hardware. However methods for efficient implementation of regular expressions in a general-purpose computer are yet to be developed. Software implementations of regular expressions either require a memory of extremely large size or execute in a non-bounded time which is a function of the number of such inexact strings to be considered, the complexity of the individual inexact strings, and input data to be examined.
One solution adopted in prior art is to use a two-stage algorithm where an algorithm for simple search, such as the Aho-Corasick algorithm, is used to efficiently find parts of packet data, which contain some part of the patterns of interest, and then a slower regular-expression-based algorithm is applied to a potentially lesser number of patterns to detect inexact patterns. Such a solution can handle a large variety of inexact patterns but has significant drawbacks including: (a) unpredictable computation effort to determine the existence, or otherwise, of a matching inexact string, the processing time being a function both of the data content and of the size and complexity of the patterns; (b) incomplete pattern identification where only a part of a pattern may be found without readily defining the boundaries of the pattern in an examined data stream—verifying a match with regular expressions may require access to a large amount of preceding data up to the possible start point, and may require waiting for data that has not yet been received; c) a requirement for post-processing to detect patterns in order of occurrence as neither the start nor end points may be known in advance, forcing ensemble matching and sorting.
Network intrusion detection and prevention is concerned with protecting computer systems from unintended or undesired network communications. A fundamental problem is in determining if packets in a data stream contain data strings of specific patterns (also called signatures) which are known to exploit software vulnerabilities in the computer systems. The number of such signatures of practical concern is very large and their structure is rapidly changing. Many of these signatures cannot practically be expressed as ordinary sequences of characters. For example a credit-card number uniquely identifies a specific credit card while a string comprising common digits of the numbers of all credit cards issued by one bank does not uniquely identify a specific credit card.
A string inserted in a data stream may be harmful to a recipient of the data stream and, hence, the need to locate the string to enable further corrective actions. Clearly, any means for detecting strings of special interest in a continuous data stream has to be sufficiently fast. One approach for fast detection is to devise special-purpose hardware circuitry with concurrent processing. However, considering the fast pace of network changes, a solution based on special-purpose hardware may be impractical.
A software solution is highly desirable because of its low cost, ease of deployment, and ease of adapting to the changing communications environment. There is therefore a need for a software-based algorithm that can detect a large set of strings under execution-time constraints and memory limitations in order to enhance Intrusion prevention systems (IPS) and intrusion detection systems (IDS).