Micron's Automata Processor (AP) (Document 1) is an innovative hardware accelerator for parallel finite automata-based regular expression matching. It is a non-von Neumann processor, which simulates nondeterministic finite automata (NFAs) mixed with Boolean logic gates and counters. The AP achieves this end by utilizing a bit-parallel technique (Document 2) within DRAM (Dynamic Random Access Memory), allowing it to run many NFAs in parallel.
The memory-derived AP can match an input byte stream with a large number of homogeneous-NFAs (Document 1) on the AP in parallel. Homogeneous-NFAs are computationally equivalent to traditional NFAs, the only difference being that homogeneous-NFAs match characters in states instead of on transitions, as shown in FIG. 1. Each NFA state (called a state transition element, or STE) in the current design of the AP is realized as a 256-bit memory column to represent the subset of 8-bit characters which that state matches. An STE matches on an input character when the input belongs to that state's character subset. This action intuitively operates as a character-OR in regular expressions.
Many automata designs on the AP have been developed for accelerating real-world applications, including: network intrusion prevention (Document 1), DNA motif searching (Document 3), Brill tagging in natural language processing (Document 4), association rule mining (Document 5), entity resolution (Document 6), and DNA alignment (Document 7). These prior works observe the AP achieve more than 100× speedup over CPU implementations.
While working toward these exciting results and other failed ones, it has been found that many automata designs on the AP face a few common problems. One problem is the number of patterns that can be represented on the AP at one time and thus checked concurrently. This pattern density is constrained by hardware resources, such as limited STE capacity or inefficient routing. Hardware limitations on the number of states that an automaton may have can either limit the amount of parallelism that machine is available (making the application less efficient), or prevent an automaton from fitting on the AP altogether (making the application impossible to accelerate with the AP). Additionally, even some small automata may have a complicated transition topology, making them difficult to route on the architecture. This can cause a poor STE utilization rate. As a consequence, for large problems that exceed the capacity of the AP, one may have to either use more AP hardware, or perform the computation using many passes through the input stream.
Another problem is that it is difficult to fully utilize the powerful character-OR ability of STEs on the AP. One reason is that if many regular expression patterns are combined together using character-ORs, one cannot later distinguish which character was matched. The other reason is that there is a mismatch between common regular expression structures and the matching procedure of the AP, as real world regular expressions typically operate over unions of larger subexpressions rather than using single character-ORs.
It is observed that it is possible to overcome these limitations. Since each STE is a 256-bit memory column, it can represent any one of 2256 possible subsets of 8-bit characters. This gives plenty of entropy to represent more complex matching behavior than the simple character-OR.
To address these problems, the subset encoding method, which can help the AP to achieve better utilization of its hardware resources for various applications, is proposed. The subset encoding method encodes both application data 1 and patterns into subsets of characters, illustrated in FIG. 2. For example, it can encode a 64-character DNA sequence into a subset of 8-bit characters, so that it only uses one STE to represent this DNA sequence. By matching encoded data with encoded automata, a richer design space can be accessed, addressing those problems mentioned above.
The subset encoding method can fully exploit the character-OR ability of STEs by encoding a sequence of data into a subset of characters, which are then put in a single self-loop STE. Effectively, the subset encoding method encapsulates a short transition history using only one STE, allowing that STE to match on a sequence of characters rather than being limited to matching on just one character. From here, the method can be adapted to find solutions, which best fit the characteristics of individual applications by analyzing tradeoffs among input stride, input rate, and alphabet size.
Furthermore, the subset encoding method improves the space efficiency and the degree of parallelism. This results in smaller automata structures with fewer states and connections, enabling more automata to be placed on the AP by reducing routing complexity. For example, the subset encoding is applied to traditional Hamming distance automata and Levenshtein automata. The large structures required for these automata make it very difficult or sometimes impossible to place and route on the AP. It is shown that after applying the subset encoding method, the automata structures become highly compressed and thus, require significantly fewer routing resources on the AP.
Experimental results show that by applying the subset encoding method, the space efficiency of these automata can be improved from 3× to 192× for different application scenarios. Multiple-stride in the subset encoding can be used to increase the input rate. As a result, overall performance on the AP can be significantly improved. This encoding technique will impact future decisions in the design of the AP or other automata-based co-processors.