1. Field of the Invention
The present invention relates to the field of computer code parsers, and in particular to a method and apparatus for statement boundary detection.
Sun, Sun Microsystems, the Sun logo, Solaris and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc.
2. Background Art
In programming languages, a program is divided into a series of statements, each of which typically execute sequentially. A language parser determines where one statement ends and another begins. Typically, a programmer must insert a special token at the end of the statement. Inserting a special token at the end of each statement is inefficient. This is better understood by a review of programming languages.
Programming Languages
Programming languages are used to express a set of detailed instructions for a digital computer. A programming language consists of characters and rules for combining them into symbols and words.
Many kinds of programming languages have been developed over the years. Initially programmers wrote instructions in machine language. This coded language, which can be understood and executed directly by the computer without conversion or translation, consists of binary digits representing operation codes and memory addresses. Because it is made up of strings of 1s and 0s, machine language is difficult for humans to understand or write. Assembly language was devised for greater convenience. It enabled programmers to express instructions in alphabetic symbols (e.g., AD for add and SUB for subtract) rather than in numbers.
Although assembly language with its mnemonic code was easier to use than machine language, it was clearly desirable to develop programming languages that more closely resembled human communication. The first so-called high-level language was FORTRAN (acronym for Formula Translation), invented in 1956. FORTRAN was well suited to scientists and mathematicians because it was similar to mathematical notations. It did, however, present some difficulty for those in nonmathematically oriented fields. As a result, a more practical programming language known as COBOL (Common Business-Oriented Language) was devised several years later (1960). COBOL employs words and syntax resembling those of ordinary English. Later, other languages even easier to learn and use were introduced. BASIC (Beginner's All-Purpose Symbolic Instruction Code), for example, can be readily mastered by the layperson and is used extensively in schools, businesses, and homes for microcomputer programming. C is a high-level language that can function as an assembly language; much commercial software is written in this flexible language. Another versatile language widely used for microcomputer as well as minicomputer applications is Pascal (probably named for the French scientist-philosopher Blaise Pascal).
Other high-level programming languages possess unique features that make each one suitable for a specific application. Some examples are APT (Automatically Programmed Tools), for numerical control of industrial machine tools, and GPSS (General-Purpose Simulation System), for constructing simulation models. LISP (List Processing) can be used to manipulate symbols and lists rather than numeric data; it is often used in artificial-intelligence applications. Fourth-generation languages (4GLs) are closer to human language than are high-level (or third-generation) languages. They are used primarily for database management or as query languages; examples include FOCUS, SQL (Structured Query Language), and dBASE. Object-oriented programming languages, such as C++ and Smalltalk, write programs incorporating self-contained collections of data structure or computational instructions (called “objects”). New programs can be written by reassembling and manipulating the objects.
Compiler
Typically, program source code is compiled before it can be executed. FIG. 1 illustrates a compiler which translates program source code into computer readable bytecode. The compiler 110 comprises a parser 101, a translator 103, and a code generator 105. The parser 101 receives input in the form of source code 100 and generates a high-level representation 102 of the program code. This high-level representation 102 may include, for example, a list of statements sorted by order of execution and a list of unique variable identifiers.
The translator 103 receives the high level representation 102 and translates the operations into a sequential representation (or intermediate form) 104 that describes the program operations. The sequential representation 104 is transformed by code generation process 105 into executable code 106 for a target simulation system. The code generator may implement one or more optimization techniques (e.g., changing the sequence of executed statements).
Statement Syntax
A program is divided into a series of statements, each of which typically execute sequentially. The structure of the statements is determined by the syntax of the programming language. When a program is compiled, first, a parser goes through the text of the source code to associate individual characters or strings of characters in the source code with structural parts of the programming language according to the syntax of the language.
For example, a parser for the C programming language would parse the string “x++; calc=x+y;” as follows: “x” is a variable, “++” is an increment operator, “;” indicates the end of a statement, ““is ignored, “calc” is a variable, ““is ignored, “x” is a variable, “+” is an addition operator, ““is ignored, “y” is a variable, and “;” indicates the end of a statement.
Statement Terminator Tokens
The parser must determine where one statement ends and the next statement begins in the input stream containing the source code for the program. This is traditionally accomplished by requiring the programmer to insert a special token at the end of each statement. For the C programming language, the statement end token is a “;”. Other programming languages use difference tokens, including a line-feed or carriage return. In some programming languages (e.g., BASIC), the end of a statement is signified by either a carriage return or a special character between two statements on the same line.