1. Field
The invention relates to parsing technology, in particular, to techniques for decomposing a complex parser, such as for a computer programming language, into successive passes of comparatively simple miniparsers that operate on the outputs of respective predecessor miniparsers.
2. Description of the Related Art
In computer technology, a parser is a program, usually part of a compiler, that receives input in the form of sequential source program instructions, interactive online commands, markup tags, or some other defined interface and breaks them up into parts (for example, the nouns (objects), verbs (methods), and their attributes or options) that can then be managed by other programming (for example, other components in a compiler). A parser may also check to see that all input has been provided that is necessary.
Parsers typically translate an input information encoding such as source code text into abstract syntax trees in two steps: first, a lexical analyzer or lexer transforms source code text into a series of tokens or word-like pieces; then a parser converts the tokens into a parse tree. Abstract syntax is a representation of data (e.g., a program being compiled) which is independent of machine-oriented structures and encodings and also of the physical representation of the data. In the case of compilation, the syntax is called concrete syntax and includes all the features visible in the source program such as parentheses and delimiters. The concrete syntax is used when parsing the program or other input, during which it is usually converted into some kind of abstract syntax tree. An abstract syntax tree (AST) is a data structure representing something which has been parsed, often used as a compiler or interpreter's internal representation of a program while it is being optimized and from which code generation is performed. The range of all possible such structures is described by the abstract syntax. A compiler's internal representation of a program will typically be specified by an abstract syntax in terms of categories such as “statement”, “expression” and “identifier”. This is independent of the source syntax (concrete syntax) of the language being compiled (though it may be similar). A parse tree is similar to an abstract syntax tree but it will typically also contain features such as parentheses which are syntactically significant but which are implicit in the structure of the abstract syntax tree.
Although interactive programming environments have found widespread acceptance, most classic parser architectures hail from an era of when computation was a scarce resource. Bottom-up parser generators have succeeded in two important goals: languages such as the BNF (“Backus Normal Form” or “Backus-Naur Form”) provide a concise and elegant notation for the expression of a language's syntax, and parsing algorithms such as LR-1 parsing minimize the time and space required to actually perform a parse. The strengths of bottom-up parser generators were critical in the environments that prevailed in the last millennium, but make little difference in many projects today and in the future. For example, when virtual machines feature an interactive programming environment in which only individual methods are expected to be compiled at a time, and when software is developed on machines with fractional-gigahertz processors and hundreds of megabytes of main memory, a parser can afford decreased performance. Thus, parsing efficiency is not always as important as development time.
Moreover, the strengths of traditional parser architectures come with weaknesses. First, the grammar itself requires a parser. A grammar can provide a formal definition of the syntactic structure of a language which is often given in terms of production rules which specify the order of constituents and their sub-constituents in a sentence or string. Of course, the grammar parser can be generated from a grammar, but some time is needed to get over the bootstrap hump. Next, bottom-up, table-driven parsers can be difficult to modify. This problem is merely a specialized case of a challenge that dogs the heels of all nonprocedural languages; a change to the specification, in this case the grammar, frequently creates unanticipated consequences. With LR parser generators, this issue typically surfaces when a programmer makes a change to the grammar only to discover that she has created unforeseen ambiguities. The hard-won description of the target language's grammar is concise but not malleable. Grammar-driven parsers require implementation effort for the grammar, and bottom-up grammar driven parsers can be brittle.
Even when a traditional, grammar-driven, parser has been tweaked to accept the desired grammar, the parser's output leaves much to be desired. A grammar-driven parse produces a concrete syntax tree, whose topology results from the hierarchical relationships between the grammar's productions. But the grammar is as much a function of what is parsable by a particular algorithm as it is of the target syntax. For example, left- or right-recursion in the grammar can induce a tall, skinny subtree when a short, flat one would be better. Or, if the grammar is incompletely factored as happens all too often, different kinds of tree nodes may redundantly implement the same semantic construct, reflecting its contextual syntactic legality instead of its meaning. Therefore, the choice to employ a grammar-driven parser frequently implies a commitment to write a post-processing system to clean up and reshape the parse tree. Not only does this system add implementation effort to a parser, but it also further impedes malleability. Whenever the grammar is perturbed, the tree postprocessor must also change.
Traditional parsers typically use a top-down recursive-descent parsing algorithm. A traditional such parser would recursively descend the grammar, traversing each token once (modulo look-ahead), in order to build the final parse tree. For example, in a typical implementation of a recursive-descent parser, at every juncture, each possibility must be tried. Because an attempt may fail, each token may actually be examined may times. Each stage of the parse must correctly choose between all possible results that could possibly start with what has already been parsed. But the only data available to make this decision are the tokens lying ahead in the input stream. Recursive descent optimizes performance but makes it harder to generate a correct parse.