1. Field of the Invention
The present invention relates to the field of computer software, and in particular to a lexical analyzer that can be configured at runtime to accept multiple languages.
Sun, Sun Microsystems, the Sun logo, Solaris and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.
2. Background Art
Computer software, which comprises one or more computer instructions, must be processed by a system known as a “compiler” before it can be executed by an intended computing environment. More specifically, the software steps by which a human is able to give instructions to a computer must be transformed by the compiler into a machine readable form for execution by processing hardware units. Thus, the function of a compiler is to transform computer instructions existing in a first representation (i.e., one understandable by a human) to computer instructions existing in a second representation (i.e., one understandable by a machine).
One component of a compiler is called a lexical analyzer. The lexical analyzer scans the characters of the source code and divides them into tokens for use in later compilation steps. Current lexical analyzers are static, meaning they will only scan for tokens known at the time the lexical analyzer was made. Thus, each lexical analyzer is bound to a certain token set which cannot easily be changed. Before discussing this problem, an overview of a compiler is provided.
Compiler
FIG. 1 shows the steps taken by an ordinary compiler. As illustrated in FIG. 1, the compiler comprises a parser 101, a translator 103, and a code generator 105. The parser 101 receives input in the form of source files 100 (e.g., C++ .cpp and .hpp files) and generates a high-level representation 102 of the source code. This high-level representation 102 may include, for example, a tokenized version of the source code file. The translator 103 receives the high level representation 102 and translates the operations into an intermediate form 104 that describes the operations. The intermediate form 104 is transformed by code generation process 105 into executable code 106 configured to run on a specific platform.
Compilers must parse source code to be able to translate it into object code. Parsing is often divided into lexical analysis and semantic parsing.
Tokens
Lexical analysis concentrates on dividing strings into components, called tokens, based on punctuation and other keys. Semantic parsing then attempts to determine the meaning of the string. A token is a sequence of characters that is treated as a unit in the grammar for a programming language. Tokens are grouped into types. Each token type is described by a pattern. A lexeme is the set of specific characters from a source file that match a pattern. Each language has its own token types, patterns and lexemes.
Token types include numbers, string literals, identifiers, character constants, reserved words (or keywords) and operators. Keywords are sequences of letters and possibly other characters that are reserved to the language. Common examples are “while”, “if” and “return”. Each keyword is a token. Operators are character sequences consisting of non-alphanumeric characters and are used by the language to represent operations. The operator may have one or more characters and must be unique. Examples are “+”, “>=” and “(”. Like the keyword token type, each operator is a token.
Each token pattern defines a language. Thus, the language for numbers is the set of all strings consisting only of the digits 0 through 9. The language for the reserved word, “if” consists of the single string, “if”.
Certain source code structures do not constitute tokens. For example, comments, pre-processor directives, and spaces do not constitute tokens.
The token set is critical because it defines the operations comprising a computer program. Each programming language has a unique set of tokens. As such, each programming language requires a unique lexical analyzer.
Lexical Analysis
Lexical analyzers are typically subroutines of parsers. The parser invokes the lexical analyzer when it needs to examine the next token in a sequence. When the lexical analyzer is invoked, it reads input characters until it reaches the next token.
An example of a lexical analyzer is called Lex. Using Lex, a separate file containing definitions, analyzer rules and user subroutines must be written before source code can be analyzed by Lex.
Thus, Lex is a static program that is either generated by a tool to understand certain tokens or is programmed by hand. There is no way to instruct a lexical analyzer at runtime to understand new or added tokens in different languages. This approach is problematic because tokens can only be added by modifying the source code for the analyzer. This process is slow, prone to error and expensive.