Lexical analyzers are generally used to scan sequentially through a sequence or “stream” of characters that is received as input and returns a series of language tokens to the parser. A token is simply one of a small number of values that tells the parser what kind of language element was encountered next in the input stream. Some tokens have associated semantic values, such as the name of an identifier or the value of an integer. For example if the input stream was:
dst = src + dst−>moveFrom
After passing through the lexical analyzer, the stream of tokens presented to the parser might be:
(tok=1,string=“dst”) -- i.e., 1 is the token for identifier(tok=100, string=“=”)(tok=1,string=“src”)(tok=101, string=“+”)(tok=1,string=“dst”)(tok=102, string=“−>”)(tok=1,string=“moveFrom”)
To implement a lexical analyzer, one must first construct a Deterministic Finite Automaton (DFA) from the set of tokens to be recognized in the language. The DFA is a kind of state machine that tells the lexical analyzer given its current state and the current input character in the stream, what new state to move to. A finite state automaton is deterministic if it has no transitions on input C (epsilon) and for each state, S, and symbol, A, there is at most one edge labeled A leaving S. In the present art, a DFA is constructed by first constructing a Non-deterministic Finite Automaton (NFA). Following construction of the NFA, the NFA is converted into a corresponding DFA. This process is covered in more detail in most books on compiler theory.
In FIG. 1, a state machine that has been programmed to scan all incoming text for any occurrence of the keywords “dog”, “cat”, and “camel” while passing all other words through unchanged is shown. The NFA begins at the initial state (0). If the next character in the stream is ‘d’, the state moves to 7, which is a non-accepting state. A non-accepting state is one in which only part of the token has been recognized while an accepting state represents the situation in which a complete token has been recognized. In FIG. 1, accepting states are denoted by the double border. From state 7, if the next character is ‘o’, the state moves to 8. This process will then repeat for the next character in the stream. If the lexical analyzer is in an accepting state when either the next character in the stream does not match or in the event that the input stream terminates, then the token for that accepting state is returned. Note that since “cat” and “camel” both start with “ca”, the analyzer state is “shared” for both possible “Lexemes”. By sharing the state in this manner, the lexical analyzer does not need to examine each complete string for a match against all possible tokens, thereby reducing the search space by roughly a factor of 26 (the number of letters in the alphabet) as each character of the input is processed. If at any point the next input token does not match any of the possible transitions from a given state, the analyzer should revert to state 10 which will accept any other word (represented by the dotted lines above). For example if the input word were “doctor”, the state would get to 8 and then there would be no valid transition for the ‘c’ character resulting in taking the dotted line path (i.e., any other character) to state 10. As will be noted from the definition above, this state machine is an NFA not a DFA. This is because from state 0, for the characters ‘c’ and ‘d’, there are two possible paths, one directly to state 10, and the others to the beginnings of “dog” and “cat”, thus we violate the requirement that there be one and only one transition for each state-character pair in a DFA.
Implementation of the state diagram set forth in FIG. 1 in software would be very inefficient. This is in part because, for any non-trivial language, the analyzer table will need to be very large in order to accommodate all the “dotted line transitions”. A standard algorithm, often called ‘subset construction’, is used to convert an NFA to a corresponding DFA. One of the problems with this algorithm is that, in the worst-case scenario, the number of states in the resulting DFA can be exponential to the number of NFA states. For these reasons, the ability to construct languages and parsers for complex languages on the fly is needed. Additionally, because lexical analysis is occurring so pervasively and often on many systems, lexical analyzer generation and operation needs to be more efficient.