In some Information Retrieval (IR) systems, there is a class of extraction techniques that rely on finding patterns of tokens that occur within a document. These IR systems look for a pattern within the specific language constructs and proceed to do so by first converting a document that is essentially a character stream into something more usable for downstream processing, called a token stream. However, most IR systems stop at that point and move on to process token streams as base primitives representing the document. The IR systems tend to incur a lot of Central Processing Unit (CPU) cycles searching for the tokens within the document, which are essentially strings, thereby, incurring the cost of primitives (e.g., string comparisons) at the character level for most of the downstream processing.
Such processing techniques are not conducive to GPU architectures that require conservative data movement between a CPU host and a target GPU. It is also expensive to process tokens on a character level and look for these tokens within the constraints of GPU Single Instruction, Multiple Data (SIMD) processing.