String (pattern) match algorithms are widely used in the areas of network intrusion detection, business analytics, extensible markup language (XML) processing, search engines, and in the frontend for compilers and interpreters. These algorithms constitute large fractions of the total processing times for representative benchmarks in the areas of network intrusion detection (around 75%) and business analytics (around 50%). Bit-parallel string match algorithms are some of the most compute and storage efficient algorithms for pattern matching.
One of the most popular pattern matching algorithms is the backward non-deterministic (BNDM) algorithm. BNDM is a bit-parallel pattern match algorithm that basically simulates a backward NFA traversal within a shift window beginning from the back of the pattern, for all transitions of the character being examined. Since, it does a backward NFA traversal (suffix) within the shift window, this algorithm terminates earlier than a prefix approach, that simulates a forward NFA traversal, which would have to examine every character in the text. The NFA traversal is simulated using bit-masks per-character, and since the bit-masks per character are operated on in parallel if they fit within a computer register, this class of algorithms is called “bit-parallel” string match.
All the bit-parallel pattern match algorithms are divided into two stages. In the first stage (offline), a set of bit-masks are constructed from the pattern to be searched for, and in the second stage (online), these bitmasks are used for performing searches on an arbitrarily long length text. The basic classification of bit-parallel algorithms are in terms of where the search is begun (prefix or suffix), and the amount of window that is shifted (pure suffix or factor). For the purposes of this description, we assume that the length of the pattern being searched for does not exceed the maximum scalar width of the processor (i.e. processor word, currently 64), but this is not an inherent limitation of the algorithm itself. Larger lengths than 64 can be simulated using a byte-array per-character. For example, a 512 character pattern match can be simulated by having a 64-byte array per character that holds the bitmask for the character.
In a conventional system, these algorithms use byte arrays in memory when the length exceeds the scalar register width. Currently, there is no hardware that can efficiently provide support for such an algorithm when the length exceeds the scalar register width.