The detection of a particular pattern in a data stream is used in many computing environments. For example, in fields such as virus detection, the data stream that is being received by a computer will need to be monitored for the presence of viruses. The virus checker will be able to recognise specific viruses and also viruses of generic types. The virus checker will have access to a data structure that includes a large number of different patterns, probably over a thousand in number. The patterns can yyyy//yy//..prise simple character sequences (strings) such as “password” or can be specified in a more flexible way, for example, using regular expressions that can include generic references to character classes and the number of occurrences of certain character and character sequences.
A data stream that is received by a computer, which needs to be analysed, will be formed of a series of bytes and in a common protocols such as TCP/IP (used for Internet communication) these bytes will be received in the form of data packets. These data packets that form the data stream are scanned for the presence of the stored patterns as the stream is received. This scanning can be executed by software, or in some environments a dedicated ASIC of an FPGA can be used to carry out the pattern matching. If a pattern is detected, then an output signal is generated, and depending upon the application, then action such as deleting the pattern from the data packet is executed.
All known pattern matching systems have one or more weaknesses. These include a large storage requirement for the data structure, the high consumption of processing resources, the difficulty of the pattern matching working in real time on streamed data, and the difficulty in updating the data structure storing the patterns, when new patterns for new viruses are to be added to the data structure.
In A. V. Aho and M. J. Corasick, “Efficient string matching: An aid to bibliographic search,” Communications of the ACM, vol. 18, no. 6, pp. 333-340, 1975, is described an algorithm for performing pattern-matching by constructing a conventional state transition diagram. The algorithm consists of constructing a finite state pattern matching machine from the keywords and the using the machine to process the text string in a single pass. The approach combines the ideas of the Knuth-Morris-Pratt algorithm with those of finite state machines. The storage efficiency, pattern-matching performance, and update performance of this method are however rather limited.