1. Field of the Invention
The invention relates to a system and method of processing structured documents. In particular, the invention relates to a system and method of lexical analysis of a structured document to produce a set of lexical tokens.
2. Description of the Related Art
The need for more robust and capable forms of data exchange on the Internet has resulted in a movement away from using easily processed binary-formatted or line-based text documents for data exchange to the use of structured documents in standardized formats such as, for example, extensible markup language (XML), hyper text markup language (HTML), or standardized general markup language (SGML). These structured documents typically are composed of human readable text containing markup symbols that define the logical structure and interrelationship of data in the document. Processing of a structured document typically begins with two steps, lexical analysis and parsing. Lexical analysis, or tokenizing, generally refers to the process of receiving a string of data bytes for a document, segmenting those bytes into one or more “lexemes,” and assigning a “token” to each lexeme. A token is an identifier that labels the lexeme as belonging to the class of strings associated with that token type. A token type may represent strings that only contain alpha-numeric characters, numbers, a punctuation symbol, or any other string of data bytes that has a particular logical relevance in a document. Parsing generally refers to a subsequent stage of syntactic analysis using the tokens as input to derive a desired data structure representing the document. The tokens may include information about a document's structure. The process of tokenizing a document augments the raw information of the document by grouping character sequences into meaningful higher order, labeled objects that form the document's structure in order to simplify subsequent parsing steps. Some token values may correspond to a keyword or fixed literal string, so that only the token value needs to be reported to the parser. In other cases, the token value indicates only the class of an associated lexeme, so the parser also needs the actual characters that comprise the lexeme. For example, XML documents contain named attributes, so an XML lexical analyzer may produce a token for attributes. Each attribute token output from the tokenizer to the parser also carries with it a corresponding lexeme, which in this case is the attribute's name. The token type may signal to the parser that it needs to add an entry to an attribute table and the lexeme is the value to add. In general, the parser uses the token type to direct its activity and the lexeme, if so indicated by the token type, is the object of the activity.
Lexical analyzers have typically been used in applications such as computer software compilers where processing performance is not at a premium. A variety of methods of tokenizing exist that are well known to those of skill in the art. In particular, state machines, such as deterministic finite automata (DFA) are typically used in tokenizers that run as software on a general purpose computer processor. However, in high-volume applications, such as in email or other server applications, software implementations may not be adequate. Performing lexical analysis is a computationally expensive step, because each byte or symbol of the information being analyzed must be processed. While every symbol may not be assigned to a token, every symbol is typically examined to make that determination. The number of tokens of output is typically significantly less than the number of symbols of input. For example, if the average number of symbols per token in a particular application is 10, then the token output rate is 1/10th the symbol input rate. In some applications, ignoring some symbols may not affect later parsing. Thus, ignoring these symbols leads to a further reduction in the number of tokens that are output. Generally, in languages, such as HTML and XML, virtually every symbol maps to a token.
When a DFA is used to perform the tokenizing process, a state machine engine is used to execute a representation of a state machine designed to recognize the lexemes that comprise the language to be parsed. A state machine has an initial state, intermediate states, and one or more terminal states. Execution always begins with the initial state. The initial state has only out-transitions to other states, or possibly one or more transitions back to itself. Intermediate states have at least one in-transition and at least one out-transition. Terminal states have only in-transitions. Associated with each transition, is a character from the symbol set the machine recognizes. As each character of input is processed, it is matched to a transition out of the current state, causing the state machine to change states. The process is repeated until a terminal state is reached. The terminal state indicates which lexeme has been identified or that there was no match, which may indicate an error.
In an implementation of a lexical analyzer using the DFA approach, a state machine is generally translated into a state transition table representation that is executed by a state machine engine. In any given state machine, each non-terminal state may have an out-transition for each possible character or symbol. Therefore, the state transition table representation must be sized accordingly. Hence, the amount of memory required by a state transition table is proportional to the product of the number of states and the total number of possible characters the machine recognizes. ASCII (American Standard Code for Information Interchange), can be represented using 7 bits, so the worst case size of the symbol set is 128. Other character sets, such as EBCDIC (Extended Binary Coded Decimal Interchange Code) and the fifteen ISO 8859, 8 bit character sets used for European languages, ISO-8859-L1 for example, are represented using 8 bits, so there can be at most 256 symbols. The Unicode standard has support for hundreds of languages with code points for thousands of characters. The UTF-16 representation uses two bytes for most characters with provisions to use four bytes for extended character sets. Just the two byte characters require support for up to 65,536 symbols. Typical state machines have hundreds of states, so the memory requirements for supporting two byte characters can rapidly become prohibitive, especially for hardware implementations. Thus tokenizers typically only support one byte representation of input symbols. When Unicode is supported, UTF-8, which represents most of the non-ASCII characters using multibyte sequences of from two to six bytes, is typically employed, and the data is processed one byte at a time. Because both HTML and XML support Unicode, support for high performance processing of Unicode symbols is desirable for many applications. However, a drawback to processing one byte at a time is lower performance compared with an implementation that can process two bytes at a time. Thus, a need exists for tokenizers that support a multi-byte representation of symbols without the impractically large state machines that would be required with a DFA.
One potential solution to improving the throughput of document processing on a general purpose computer processor system is to offload portions of the processing to special purpose content processors. Content processors typically comprise dedicated electronic hardware adapted to performing portions of document processing in a server. Thus, one way of increasing throughput of a lexical analysis is to perform this task using specialized content processor hardware. However, the large size of the state machines generated for a typical high level language such as, for example, XML, has limited the application of hardware solutions such as, for example, field programmable gate arrays (FPGA) that might be employed in a content processor. Thus, a need exists for improved systems and methods of tokenizing documents.