Parsers and parser generators are well known software tools. In general terms, a parser is a program which accepts data comporting with a source language. The input data generally includes words and symbols which comport with the source language and which are known collectively as tokens. The parser reads the tokens of the input data and matches sequences of tokens generally referred as constructs, with a set of rules known by the parser. As each construct is matched with a particular rule, the parser calls one or more actions associated with the matched rule. The process of matching constructs with rules and calling actions continues until the parser reaches the end of the input, or until a nonrecoverable error occurs in the input.
An example of a simple parser would be a parser which accepts a series of algebraic equations as an input which comports to a particular language. The parser would have a set of rules that represent various constructs of algebraic equations. For example, the parser would have a rule that would match constructs of the form "number+number" and another rule that would match construct of the form "number--number." As each construct is matched, the parser would perform the actions required by the particular construct matched. In the case of the construct "4-4", for example, the parser would call the actions required to generate the sum of sixteen.
Collectively, the set of rules implemented by a parser are known as a "grammar". Grammars can be organized using a tree-like hierarchy. At the top of the hierarchy, there are general rules. Conversely, the bottom of the hierarchy is composed of the most specific rules. For example, in the case of the parser for algebraic equations already discussed, the top of the hierarchy might be composed of a rule that indicates that the source data is composed of a series of separate algebraic equations. The next level of the hierarchy would include rules describing which components are required for a single equation. Still further down the hierarchy would be rules for the sub-components that are required for each component of an equation.
Typically, parsers use one of two related but different techniques. These techniques are known as "bottom-up parsing" and "top-down parsing." In the case of bottom-up parsers, rules in the grammar are matched by first finding those constructs in the parser input which match the rules at the lowest level of the hierarchy. These matching constructs are then combined to match rules at the next higher level of the hierarchy, and so on, until an entire rule is matched. In contrast, top-down parsers attempt to match the highest level rules in the hierarchy first. These parsers then work their way down the hierarchy, matching rules at successively lower levels.
In practice, parsers are used for far more than evaluating algebraic equations. In fact, parsers can be used for a wide range of source languages to perform a wide range of functions. For example, parsers are commonly used to direct the actions of compilers and interpreters. Parsers are also used to process configuration files and to process the headers of electronic mail as well as a variety of other purposes. Generally, these uses require source languages and actions which are considerably more complex than the source language and actions described for the simple example of a parser which evaluates algebraic equations. The complexity of these source languages and their associated actions means that the parsers themselves are often quite complex and, as a result, often difficult to design and implement.
The complexity of parsers has lead to the development of a number of different parser generators. Parser generators are tools which allow a programmer to define a grammar which corresponds to a particular source language. The programmer adds actions which correspond to the rules in the grammar and the parser generator uses the grammar and rules to construct a new parser. When complete, the new parser functions as a parser for the source language which includes both the rules and the actions which the programmer has defined.
Traditionally, most parser generators have produced bottom-up parsers. For example, the "Yet Another Compiler Compiler (yacc)" parser generator used in the UNIX operating system programming environment is a well known parser generator which produces bottom-up parsers. More recently, however, parser generators, which produce top-down parsers (such as the parser generator supplied with the Prudue Compiler Construction Tool Set (PCCTS)), have increased in popularity. This increase in popularity is due partially to the fact that the parsers produced by top-down parser generators are more intuitive to human engineers who attempt to locate and correct errors in such parsers. In addition, many programmers find the task of defining actions to be easier when the actions are designed to be used in a top-down parser.
Unfortunately, there are still a number of disadvantages associated with the use of parser generators. One such disadvantage is the potential of ambiguity within the parser grammar. More specifically, when a grammar is constructed for a source language, there will generally be rules that will have some degree of ambiguity. Ambiguity, in this case, means that a given sequence of tokens may match more than one rule within the grammar. This is problematic for the parser, since the parser will not know which rule to choose when the parser encounters a token sequence which matches more than a single rule.
Fortunately, most ambiguities may be solved by increasing the number of tokens considered by the parser at a given time. Thus, when the parser encounters a sequence of tokens which match more than one rule, the parser examines the tokens subsequent to the ambiguous token sequence. If the subsequent tokens are inconsistent with a rule which matches the ambiguous token sequence, that rule can be removed from consideration by the parser. The subsequent tokens used by the parser are known as lookahead tokens and the consideration of lookahead tokens is generally referred to as parser lookahead.
Traditionally, parser generators have provided a fixed number of lookahead techniques. As a result, the use of additional lookahead tokens to disambiguate grammars has not generally been possible. More recently, parser generators, have allowed programmers to specify an increased number of lookahead tokens using two different techniques. One technique is to allow the programmer defining the grammar to specify the number of lookahead tokens that must be maintained by the parser. This method causes the parser generator to construct a parser which always maintains the number of lookahead tokens specified by the programmer. Unfortunately, it is often the case that the number of lookahead tokens specified by the programmer will only be required for resolving disambiguities in a very small portion of the parser's rules. The remaining rules generally require few, if any, additional lookahead tokens. Since maintaining additional lookahead tokens is relatively time-consuming, parsers which maintain a specific number of lookahead tokens can be relatively inefficient.
The second technique employed by parser generators to deal with the problem of token lookahead is to employ what is generally known as infinite lookahead. Infinite lookahead is a technique which causes the parser to start accumulating lookahead tokens whenever it encounters a sequence of input tokens which may be matched by two or more rules. The process of accumulating input tokens continues until the parser is able to resolve the ambiguity. Unfortunately, the process of iteratively fetching lookahead tokens is also inefficient.