Each programming language uses its own syntax and semantics; syntax used in the Fortran language is different from the C language syntax, etc. Programs written in any programming language have to be compiled, and during that process their syntax and semantics are verified. Syntax is the structure and specification of each language according to rules established for each language, i.e. grammar. Semantics of each language is the meaning conveyed by and associated with the syntax of such language. In compiling computer programs, parsing is an analysis of a stream of program expressions (sentences) for determining whether or not the program expressions are syntactically correct. Once it is determined that a stream of program expressions is syntactically correct, that stream of program expressions can be compiled into executable modules. Parsing is automatically performed in a computer using a computer program.
In parsing a computer program input stream, written in Fortran, for example, a scanner using a set of rules groups predetermined characters in the input steam into tokens. Scanners are programs constructed to recognize different types of tokens, such as identifiers, decimal constants, floating point constants, and the like. In recognizing or identifying a token, a parser may look ahead in the input stream for additional predetermined characters for finding additional tokens.
The parser imposes a structure on the sequence of tokens using a set of rules appropriate for the language. Such rules are referred to as a context-free grammar; such rules are often specified in the so-called and well known Backus Naur form. A such a grammar specification for a program expression consisting of decimal digits and the operations "+" and "*" may be represented as follows:
E : E "+" T PA1 E : T PA1 T : T "*" F PA1 T : F PA1 F : decimal.sub.-- digits
Each of the five grammar rules above, one on each line, is referred to as a "production". In the above program specification the tokens detected by the scanner are "+", "*" and decimal.sub.-- digits. Such tokens are passed to the parser program. Each string in the input stream that is parsed as having correct syntax is said to be "accepted". For example, the string 2+3*5 is "accepted" while the string 2++5 will be rejected as syntactically incorrect.
A left-to-right, right-most derivation (LR) parser accepts a subset of a context-free grammar. Each LR parser has an input, an output, a push-down stack, a driver program and a parsing table. The parsing table is created from the grammar of the language to be parsed and is unique to such language and its grammar. The driver program serially reads tokens one at a time from the input stream. The input stream is typically stored in a computer storage and is scanned by the driven program scanning the stored input steam to fetch the tokens. Based upon the information in the parsing table that corresponds to the token being analyzed (input token) and the current program state, the driver program may shift the input token into the stack, reduce it by one of the productions, accept a string of such tokens, or reject the string of such tokens as being syntactically wrong. Reduction means that the right-hand side of a production is replaced by the left-hand side. An LR parser may also fetch a next token from the input stream for determining whether or not to shift or to reduce the token. Such a token is termed a "lookahead" token and is referred to herein as a look ahead portion of the input stream. The lookahead portion may include more than one token. When an LR parser performs reduction, additional semantic checks (also termed semantic actions) are performed.
Parsers use tables in the parsing process. It is desired enhance the parsing process, particularly in an LR(k) parser, wherein k is the lookahead limit in the parsing. As indicated above, such parsers are well known as taking a tokenized sentence from a computer language to produce an output which is a canonical parse of the sentence. While the actual parsing procedure is performed by a known parser interpreter, the parser table itself is in a form of data structures or tables. The tables are generated or established by a so-called LR analyzer as a series of data loaded variable declarations from a context free grammar for each language being parsed, each language will be parsed by interpreting the parsing tables established for each language.
As mentioned above, each LR parser consists of a known modified finite automation with an attached push-down stack. At each discrete instance during a parsing operation, parser control resides in one of the parser's machine states, the stack being filled with the most recent past parser states. The parser is looking ahead in the input stream (the computer program to be parsed and compiled) for a next token. Each parser state offers an automatic choice between two types of actions; reductions and read transitions. Each parser state may contain any number of defined reductions or read transitions to be utilized in parsing.
Reductions, as mentioned above, consist of a production number P and a collection of terminal symbols R, taken as a pair, and are always considered first in each state of the parser. If lookahead symbol L is in set R for production P, then the reduction is to be performed (there can never be more than one candidate pair). As output of the production, the number P is given to a semantic synthesizer. Then, as many states as there are symbols on the right hand side of production P are popped off the stack; the non-terminal on the left-handed side of the production P is put in place for the next look ahead (the original lookahead L is pushed back into the input stream) and the state exposed at the top of the push-down stack takes control of the parser action.
Read transitions consist of pairs of parser stages S and vocabulary symbols X. When the lookahead symbol L matches the read symbol X, by construction there can be at most one such match, and lookahead symbol L is stripped from the input stream, state S is pushed onto the stack and state S controls the parsing operation. LR parsers always begin in a state 0 (zero) with the push-down stack being empty and finished with production 0 which is the production: EQU &lt;system goal symbol&gt;::=.sub.13 .vertline..sub.13 &lt;sentence&gt;.vertline..sub.13 .
The term &lt;sentence&gt; represents the programming language goal symbol and the symbol .sub.-- .vertline..sub.-- is a terminal symbol reserved for this production.
The parser's basic program structure is a parser loop over the discrete time steps defined by parser state changes. Each cycle searches for and performs one reduction or one transition. While a parser need only maintain a state stack, the current parser state and the lookahead symbol, more information is maintained for tracing, semantic and error correction purposes. The push-down stack can have several fields, one field holding the token just read from the input stream when the state was stacked, another field holding the actual character string read from the input stream, another field holds a serial number for the token and extra fields can be used for maintenance by a semantic synthesizer.