1. Field of the Invention
The present invention relates generally to the field of compiler design, and more particularly to the construction of lexical analyzers that can efficiently accept multibyte character sets.
2. Discussion of Related Art
Lexical Analyzers are one of the cornerstones of compiler creation and are used in many areas of computer science for a multitude of applications. The identification of words and delimiters is a necessary task for any language processing task. The main task of a lexical analyzer is to read input characters from a source program and produce as an output a sequence of tokens. This process is also called "tokenization" because the process generates word and punctuation tokens.
In the past, lexical analyzers have been built to recognize only characters that fall within the realm of single byte character sets. Single byte character sets are sufficient for representing most English and European languages. However, languages from the Pacific Rim, such as the Kanji character set used in Japan, require two bytes to represent the multitude of characters.
Building a lexical analyzer for a character set generally requires tables to be constructed with a number of entries approximately equal to the number of different characters that are to be recognized. In a single byte character set this requires on the order of 256 entries in its tables (which is two raised to the power of eight). Following this logic, a lexical analyzer for a two byte character set would require on the order of 65,536 entries in its tables (which is two raised to the power of sixteen). This is a prohibitively large amount of space that will be needed to build and run the lexical analyzer. Therefore, it is readily apparent that a more efficient system and method is needed in order to recognize character sets that can only be represented with two bytes of data.