Many network security applications in today's networks are based on deep packet inspection, checking not only the header portion but also the payload portion of a packet. Multi-pattern regex matching, in which packet payloads are matched against a large set of patterns, is an important algorithm in network security applications. Nowadays, most processor vendors are increasing the number of cores in a single chip. The trend is observed not only in multi-core processors but also in many-core processors. Since deep packet inspection is often a bottleneck in packet processing, exploiting parallelism in multi-core and many-core architecture is a key to improving overall performance. In the past few years, huge advances in technology and frequency scaling allowed the majority of computer applications to increase in performance without requiring structural changes or custom hardware acceleration. While these advances continue, their effect on modern applications is not as dramatic as other obstacles such as the memory-wall and power-wall come into play. Now, with these additional constraints, the primary method of gaining extra performance out of computing systems is to introduce additional specialized resources, thus making a computing system heterogeneous. A CPU and GPU heterogeneous platform is a very typical one of such heterogeneous computing systems.
Because of the massive parallelism and computational power, Graphics Processing Units, the typical many-cores devices, have been a viable platform for general-purpose parallel computing. Multi-pattern regex matching, which is a performance consuming algorithm, is very suitable to be offloaded from CPU to GPU.
Prior art technology is usually using DFA (Deterministic Finite Automation), mDFA (multiple DFA) or their edge compression algorithm D2FA (Delayed Input DFA) to perform multi-pattern regex matching. State compression algorithm like HFA (Haskell Finite Automata) and XFA (Extended Finite Automata) is not suitable for GPU because such algorithms have many logic branches.
FIG. 1 illustrates a multi-core and many-core processor system 100 including a CPU 110 and a GPU 120. A pattern compiler 111 on the CPU 110 includes a unit for creating and optimizing DFA 112 and a unit for DFA state encoding 113. The GPU 120 includes a global memory 121 on which a memory 123 is implemented that may exchange data with a host packet buffer 114 on the CPU 110 by DMA (direct memory access). The global memory 121 further includes a results buffer 122 where matching results 115 are accessible from the CPU 110 for output. The global memory 121 further includes a state machine memory 124 that is coupled to the unit for DFA state encoding 113 on the CPU 110. The GPU 120 further comprises a set of computing units 125a, 125b, 125c each including a DFA kernel 127 including a set of streams 126a, 126b, 126c, and 126d. 
The multi-core and many-core processor system 100 can perform multi-pattern regex matching on GPU 120.
A flow diagram of a compiling process performed on the CPU 110 is illustrated in FIG. 2. The compiling process performs compiling of all regex patterns into state machine data structure 112, and encoding it to DFA state table 113. The compiling process further performs uploading of DFA state table 113 into GPU Global Memory 120, 124.
The stages of the compiling process are illustrated in FIG. 2. After starting 201 the compiling process, the multi NFA is created 202. For each pattern in pattern set 203, the pattern is compiled to NFA 204 and the NFA is linked with multi NFA 205. This is performed until the last pattern 206. Then multi NFA is compiled to NFA and optimized 207, DFA state is encoded 208 and DFA state table is uploaded to GPU global memory 209 and the compiling process is finished 210.
FIG. 3 shows the overall architecture of Gnort 300, a network intrusion detection system (NIDS). Gnort is using the GPU 320 for multi-pattern matching for Snort, a network intrusion detection system on which Gnort is founded. The CPU 310 collects packets 311, decodes 312 and preprocesses 313 them. A separate buffer is used for temporarily storing the packets of each group. After a packet has been classified 314 to a specific group 315, it is copied to the corresponding buffer 321. Whenever the buffer gets full, all packets are transferred to the GPU 320 in one operation. The GPU 320 performs multi-pattern matching and puts matches to a result buffer 326 from which the CPU 310 gets the matches 327 and then continues processing. Once the packets have been transferred to the GPU 320, the pattern matching operation is performed on a plurality of multiprocessors 325 by using packet texture 323 and state table texture 324. The algorithm iterates through all the bytes of the input stream and moves the current state to the next correct state using a state machine that has been previously constructed during initialization phase.
FIG. 4 illustrates the flow diagram for CPU side processing 400. After start 401, a packet is received from a network 402, the packet is preprocessed 403. If fast path 404 is enabled, the next packet is received from the network. If fast path 404 is not enabled, the packet is put into packet buffer 405. If the buffer is full or timeout 406, the packet buffer is transferred to GPU by using direct memory access (DMA) 407 and matches are received from GPU 408 which are used for the next process 409. If the buffer is not full or no timeout 406 happens, the next packet is received from the network.
FIG. 5 illustrates a schematic diagram 500 for GPU side processing and FIG. 6 a corresponding flow diagram 600 for GPU side processing. Input data 510 includes a number of N packets 511, 512, 513, 514. Each packet is processed by a respective thread 521, 522, 523, 524 and forwarded in an ordered sequence to the state transitions table 520. After start 601, the GPU packet buffer is checked 602. If the buffer is not empty, the kernel starts pattern matching 603 and N threads 604, 605, 606, 607 match respective packets 1 to N by using the state transitions table 520. The result is written to result buffer 608.
In order to fully utilize the massive parallelism, prior art treats each thread as a regex engine, accesses the state table in global memory and searches one packet each time.
As DFA algorithms as described above have excessive space complexity, the DFA state table is very large, threads access the GPU global memory very frequently while pattern matching, which will dramatically decrease the performance. Each thread needs to visit the whole DFA state table. The DFA state table, however, usually extends to tens or even hundreds of megabytes (MB). Therefore, the thread access to GPU global memory happens very frequently. The waiting time for a thread to terminate is long. In the network, the size of a packet can be quite different, so the workload of each thread is not equal, the first finished thread may need to wait until the last thread is finished. This will deteriorate the overall parallelism of the GPU.