Text processing applications deal with textual data encoded as strings or streams of characters following conventions of a particular character encoding scheme. Historically, many text processing applications have been developed that are based on fixed-width, single-byte, character encoding schemes such as ASCII and EBCDIC. Further, text processing applications involving textual data in various European languages or non-Roman alphabets may use one of the 8-bit extended ASCII schemes of ISO 8859. Still further, a number of alternative variable-length encoding schemes have been used for Chinese, Japanese or Korean applications.
Increasingly, Unicode is being used as a basis for text processing applications that may need to accommodate, and/or perhaps combine, text arising from different sources. The Unicode character set is designed to include characters of all the world's languages, as well as many additional characters arising from formal notation systems used in mathematics, music and other application areas. As is well known, UTF-8, UTF-16 and UTF-32 are the three basic encoding schemes of Unicode that are based on 8-bit, 16-bit, or 32-bit code units, respectively. In particular, UTF-8 is a variable-length encoding scheme that requires one to four 8-bit code units per character; UTF-16 is an encoding scheme that generally requires a single 16-bit code unit per character (some rarely used characters require 2 code units); and UTF-32 is a fixed-length encoding scheme that requires a single 32-bit code unit for each character. UTF-16 and UTF-32 have variations known as UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE, depending on byte-ordering conventions within code units.
While Unicode allows interoperation between applications and character streams from many different sources, it comes at some cost in processing efficiency when compared with legacy applications based on 8-bit character encoding schemes. This cost may become manifest in the form of additional hardware required to achieve desired throughput, additional energy consumption in carrying out an application on a particular character stream, and/or additional execution time for an application to complete processing.
Applications may further require that the content of data streams be structured according to lexical and/or syntactic conventions of a text-based notation system. Many such conventions exist, ranging from simple line-oriented structuring conventions used by various operating systems to formal programming language grammars used for representing computer programs as source language texts. Of special importance is the growing use of XML as a standard, text-based, markup language for encoding documents and data of all kinds. In each case, the imposition of structuring information may add considerably to resource requirements of relevant text processing applications.
In general, high-speed text processing in the prior art uses sequential, character-at-a-time (or byte-at-a-time) processing, often written in the C programming language. For example, much prior art for XML and Unicode string processing teaches use of the sequential character processing approach. This is also true of standard computing science textbooks dealing with parsing, lexical analysis, and text processing applications.
There are three basic techniques used in the prior art for implementing text processing applications. The first basic technique is a hand-coded implementation using iterative looping (for example, while loops) and branching instructions (for example, if-statements) to perform conditional actions based on particular characters or character classes. The second basic technique is a variation of the first in which decomposition of separate logic for different characters or character classes is handled through jump tables (for example, case statements). The third basic technique systematizes the use of tables in the form of finite state machines. Finite state machine implementations derive from standard theoretical techniques for string processing; namely, representing character and lexical syntax by regular expression grammars and recognizing character strings matching these grammars using finite automata. Finite state machine techniques can give efficient implementations when the number of states and the number of potential character transitions per state is reasonably small; for example, applications involving 7-bit ASCII processing require at most 128 entries per state. However, a straightforward implementation of finite state machines based on 16-bit representations of UTF-16 would require more than 64,000 entries per state. Thus, for state spaces of any complexity, this quickly becomes prohibitive.
Industry standard processors have evolved through 8-bit, 16-bit and 32-bit architectures. In addition, character encoding schemes have evolved from the 8-bit representations of extended ASCII through the 16-bit and 32-bit representations of UTF-16 and UTF-32. Through this period of evolution of processor architectures and character encoding schemes, there has been a rough match between processor capabilities and the requirements of character-at-a-time processing.
Although the evolution of character encoding has now likely reached a point of long-term stability through the Unicode standard, processor architectures are continuing to evolve. In particular, recent years have seen an increasing mismatch between processor capabilities and character-at-a-time processing requirements. Specifically, industry standard processor architectures now routinely include capabilities for single-instruction, multiple-data processing based on 128-bit registers, while processors with 64-bit general purpose registers are being increasingly deployed. These registers are potentially capable of dealing with a number of characters or code units at a time, for example, up to 16 UTF-8 code units could be processed using 128-bit registers. In addition, processors have developed sophisticated instruction and data caching facilities for increasing throughput. With respect to instruction caching, in particular, throughput advantages provided by pipelining are largely negated by sequential character processing software that is heavily laden with branch instructions for conditional character logic. Data cache behavior may also be a problem, particularly for finite-state machine and other table-based implementations that may use large transition or translation tables.