Many data mining and knowledge discovery frameworks rely on sequential scanning of data streams for key word scanning and pattern matching as in, for example, intrusion detection systems and document indexing, such as the Lucene index writer by The Apache Software Foundation.
Scanning is the process of partitioning a stream of fine grain data (referred to as characters in this specification), into tokens, a token being a sequence of characters from the original stream. The component that performs the operation of scanning is called a scanner. The output of a stream scanner is a sequence of tokens, which is input to a downstream component in a stream processing system, e.g. for the purpose of document indexing, data mining, filtering, and other uses. The operation of such a downstream component is orthogonal to the operation of the scanner component and hence we will not elaborate on it further in this disclosure.
A scanner is typically part of a stream processing system as illustrated in FIG. 6, consisting of a scanner 002 and a downstream processing component 004. The scanner 002 receives as input a character or binary data stream 001. The output of the scanner and the input to the downstream processing component is a stream of tokens 003.
The process of scanning is commonly done sequentially: each token is recognized through a finite state machine that processes characters in the order delivered by the stream; moreover, individual tokens are recognized in sequence, i.e., the character following the end of one token being the start of the next token. This sequential method of token recognition can limit the rate at which a stream is processed if the computation required to recognize the token is larger than the rate at which the characters corresponding to the token can be obtained from the input stream. In other words, the sequential computation required for the token recognition can be the limiting factor in the throughput that the overall stream processing system can achieve.
The theory of finite state machines and the construction of scanners for character streams is a well understood problem and has been extensively studied and documented in the literature. A significant shortcoming of known algorithms for sequential scanning is that they operate in a serial manner on the stream and are not parallelized.
Recent technology trends led to steady increases of network stream bandwidth on the one hand and stagnating single-thread computational performance of microprocessors on the other hand. These trends raise a dilemma and demand for methods to accelerate the sequential paradigm of stream scanning.