A lexical analyzer breaks an input stream of characters into programming language tokens (or simply “tokens”). A token is the basic component of source code. The tokens are typically categorized as one of five classes of tokens that describe their functions (constants, identifiers, operators, reserved words, and separators), in accordance with the rules of the programming language. For example, a lexical analyzer may take source code as input and break the source code into tokens to produce output that may be used by a parser to generate byte code.
Traditional lexical analyzers were typically designed to perform a single specific task such as pre-processing source code, compiling source code, pretty printing, etc. These lexical analyzers were typically constructed by hand or using a generic lexical analyzer generator. Each such lexical analyzer typically incorporated certain assumptions about what constituted a token. The assumptions typically included generic rules as to what constituted a token without addressing specific implementation details of a particular programming language. Additionally, the traditional lexical analyzers allowed the user to manipulate the actions that could be executed when a particular token, according to the assumptions incorporated into the lexical analyzers, was encountered.
FIG. 1 illustrates a set of prior art lexical analyzers. Inputs A, B, and C (2, 4, and 6, respectively in FIG. 1) correspond to source code, or any other type of input stream that requires lexical analysis. Lexical Analyzers A, B, and C (8, 10, and 12, respectively in FIG. 1) are lexical analyzers designed for a specific function. For example, Lexical Analyzer A's (8) function may be to compile source code. Thus, Lexical Analyzer A (8) would include lexical rules to ignore and discard comments and white space present in input A (2) to produce output A (14). Similarly, Lexical Analyzer B's (10) function may be pretty printing. Thus, Lexical Analyzer B (10) would include rules to ignore white space and preserve comments in the input B (4) to produce output B (16). Further, lexical analyzer C (12) may be a pre-processor that preserves both comments and white space.