Field
This disclosure relates to the field of data processing systems. More particularly, this disclosure relates to querying input data.
Background
It is known to provide hardware accelerators for certain processing tasks. One target domain for such accelerators is natural language processing (NLP). The explosive growth in electronic text, such as tweets, logs, news articles, and web documents, has generated interest in systems that can process these data quickly and efficiently. The conventional approach to analyse vast text collections—scale-out processing on large clusters with frameworks such as Hadoop—incurs high costs in energy and hardware. A hardware accelerator that can support ad-hoc queries on large datasets, would be useful.
The Aho-Corasick algorithm is one example algorithm for exact pattern matching. The performance of the algorithm is linear in the size of the input text. The algorithm makes use of a trie (prefix tree) to represent a state machine for the search terms being considered. FIG. 1 of the accompanying drawings shows an example Aho-Corasick pattern matching machine for the following search terms, added in order: ‘he’, ‘she’, ‘his’ and ‘hers’. Pattern matching commences at the root of the trie (state or node 0), and state transitions are based on the current state and the input character observed. For example, if the current state is 0, and the character ‘h’ is observed, the next state is 1.
The algorithm utilizes the following information during pattern matching:                Outgoing edges to enable a transition to a next state based on the input character observed.        Failure edges to handle situations where even though a search term mismatches, the suffix of one search term may match the prefix of another. For example, in FIG. 1, failure in state 5 takes the pattern matching machine to state 2 and then state 8 if an ‘r’ is observed.        Patterns that end at the current node. For example, the output function of state 7 is the pattern ‘his’.        
Typically, to ensure constant run time performance, each node in the pattern matching machine stores an outgoing edge for all the characters in the alphabet being considered. Therefore, each node has branching factor of N, where N is the alphabet size. For example, for traditional ASCII, the branching factor is 128. However, storing all possible outgoing edges entails a high storage cost. A technique to reduce the required storage through bit-split state machines has been proposed by Tan and Sherwood (L. Tan and T. Sherwood. A High Throughput String Matching Architecture for Intrusion Detection and Prevention. In Computer Architecture, 2005. ISCA '05. Proceedings. 32nd International Symposium on, 2005). The authors propose the splitting of each byte state machine into n-bit state machines. Since the bit state machine only has two outgoing edges for each node, the storage requirement is reduced drastically. Each state in the bit state machine corresponds to one or more states in the byte state machine. If the intersection of all bit state machines maps to the same state in the byte state machine, a match has been found and is reported.
Since regular expression matching involves harder to encode state transitions, transition rules that offer greater degrees of flexibility may be used. Transition rules of the form <current state, input character, next state> can be used to represent state machine transitions for regular expression matching. Van Lunteren et al. (J. Lunteren, C. Hagleitner, T. Heil, G. Biran, U. Shvadron, and K. Atasu. Designing a programmable wire-speed regular-expression matching accelerator. In Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on, 2012) use rules stored using the technique of balanced routing tables; this technique provides a fast hash lookup to determine next states. In contrast, Bremler-Barr and co-authors (A. Bremler-Barr, D. Hay, and Y. Koral. Compactdfa: Generic state machine compression for scalable pattern matching. In INFOCOM, 2010 Proceedings IEEE, 2010), encode states such that all transitions to a specific state can be represented by a single prefix that defines a set of current states. Therefore, the pattern-matching problem is effectively reduced to a longest-prefix matching problem.