Understanding and maintaining source code is largely a manual process. A human must study the code to understand what it is doing and must make desired changes to the code manually in a text editor. To help automate some of the more common and onerous tasks a variety of text editors have been constructed to analyze and manipulate text files containing source code. Many of these editors are based on tools such as grep, awk and their derivatives for analyzing regular expressions. Such editors help in detecting patterns and often make provision for automatically changing the code. Text-based tools work reasonably well with easily identified text groupings (e.g., text on one or more lines) but they fail to adequately address structural characteristics of the code. It is not feasible, or in many cases even possible to detect complex structural patterns in the code, much less to make desired changes.
The current state of the art is based on repository technology, in which code is typically reverse engineered into abstract syntax trees. Reverse engineering technologies are based on grammar-based parsers and associated lexical analyzers, which identify structural characteristics of the source code. These characteristics may range from the highest (e.g., program) level of abstraction inherent in a file to the lowest level tokens (items which make up individual expressions). In this case information is stored in what are commonly called abstract syntax trees (ASTs), in which individual nodes may have various attributes. For example, a variable may have type and/or value attributes.
ASTs themselves are typically constructed by attaching small routines or code fragments (hereafter commands) to individual productions in the grammar, which work in conjunction with a language independent parser. Specifically, a programmer or parser generation program (e.g., YACC) attaches commands to individual productions in the associated grammar, which convert code as it is being parsed into an AST.
The process of constructing standard ASTs from given source code using a grammar representing the language in which the source code is written is well known (e.g., Aho, Sethi & Ullman, 1986). Parsing source code into trees plays a major role in analyzing and/or transforming that text. A variety of techniques have been used to improve the process of compiling source code. Carroll et al (U.S. Pat. No. 5,812,853), for example, revealed an improved form of prefix analysis to speed the process. Parsing techniques have also been used to analyze and/or translate natural language text. Kuno et al (U.S. Pat. No. 5,528,491), for example, propose a system that allows human operators to interact with automated translation processes in a way that preserves correct portions of the translation. Success in attaching semantic information to ASTs for reverse engineering purposes and/or to facilitate program understanding also has been achieved (e.g., Canfor, Cimitile & Munro, 1993; Welty, 1997).
Although the process itself is well known and widely used, creating reverse engineering systems that can automatically construct ASTs from source code takes a good deal of manual effort. The more constraints on, or special characteristics an AST is to have, the more effort that is required.
Knowing which commands (or routines) to attach to which productions to construct particular types of ASTs is not an easy task. It requires intimate familiarity with the grammar, the goal to be achieved (e.g., the type of AST) as well as the programming language used. For example, it is relatively easy to construct ASTs from source code where all constructs in the language are handled in a uniform manner. That is when everything from atomic tokens in the language to high level constructs (e.g., procedural refinements, relations between modules, units, etc.) are to be represented as nodes in ASTs in the same manner.
Attaching commands to individual productions in a grammar to achieve prescribed puposes is complicated. Adding semantic attributes to, or constraints on, the to-be-constructed ASTs further complicates the process. Achieving any specific goal, whether it is to construct a particular kind of AST from source code, or to manipulate text using a grammar requires attention to all such commands and the grammatical context in which they are executed. In particular, commands attached to any one production must take into account commands attached to other productions, which will be executed when the source code is parsed. In addition, attributes derived by manipulating tokens associated with any one production may effect what manipulations are to be performed by other productions. Complications deriving from such context dependencies increase rapidly both with the number of dependencies and the length of the grammar. Cyclic relationships in grammars further complicate the situation.
Given the large variety of programming languages and dialects, it is not practical to develop manual solutions for more than a small fraction of the many possible code variants. The present disclosure reveals a method that dramatically simplifies, even automates the creation of automated reverse engineering systems for constructing specified kinds of ASTs from source code, for example Flexform ASTs (originally called FLOWforms, Scandura, U.S. Pat. No. 5,262,761).
Editing, displaying, analyzing and manipulating ASTs also can be done in a relatively straightforward manner. Sotani (U.S. Pat. No. 5,481,711) further disclosed a specific method motivated by YAAC programs for editing source text data lexically and structurally by reference to its corresponding AST. Simonyi (U.S. Pat. No. 5, 790,863) recently received a patent for incrementally constructing, displaying and editing trees without parsing the text. His disclosure shows how trees representing computer programs can be constructed by a user as source code is written and/or edited. Simonyi also shows how the underlying trees can be displayed in various programming languages. Scandura (1987, 1990, 1991, 1992, 1994) earlier showed other ways of accomplishing the same things. Once information is in an AST, information in the AST may be displayed, edited, manipulated and/or otherwise converted by writing code that applies directly to the AST. In general, ASTs make it easier to manipulate information associated with source code than when in text form. Sotani (U.S. Pat. No. 5,481,711), for example, describes one such method suited for editing source code text by reference to its corresponding AST. Simonyi (U.S. Pat. No. 5, 790,863) reveals another such method for editing ASTs. For example, ASTs can be displayed by writing routines that operate directly on the ASTs. Displays also can be constructed by converting ASTs into a form that can be used by an existing display technology. Software systems are commonly available for displaying information in ASTs as simple tree views, data flow diagrams, structure charts, Nassi-Shneiderman charts and Flexforms (formerly known as FLOWforms, Scandura, U.S. Pat. No. 5, 262,761) among others. The Simonyi disclosure also shows one way in which ASTs may be displayed in multiple languages.
The present disclosure is not concerned with editing ASTs or the text they may represent, nor is it concerned with displaying ASTs. Rather, it is concerned as above with simplifying the construction of ASTs, and with the automated analysis and manipulation of ASTs once they have been constructed.
In general, ASTs make it easier to automatically analyze and/or manipulate information associated with source code than when the information is in text form. Various scripting and other languages have been developed to facilitate the automated analysis, manipulation and/or conversion of ASTs. The goal of these languages typically is to make it possible to detect complex structural patterns and to manipulate those patterns.
Although GUI interfaces and/or automated aides may facilitate the process, programmers still need to address the full complexity of a single AST representing an entire program (or set of programs). The high degree of human effort required limits the construction of automated analysis and/or conversion processes to common, general purpose problems (e.g., converting from one language or operating system to another). Indeed, the cost of constructing automated technologies to satisfy application specific conversion needs is prohibitive using current technologies.
Another major limitation is that programs are typically represented as single, comprehensive ASTs (trees) that are very difficult, if not impossible, for human beings to understand. Nodes in a traditional AST range from atomic level terminal tokens at the bottom of the tree to the highest level abstractions.
Furthermore, automatic processing of full uncompressed ASTs often takes too much time. It is well known that the size of an AST grows rapidly (exponentially) with the source code being processed. Consequently, as the length of the source code increases, the time required to process an AST representing that source code quickly becomes too long for many practical purposes, whether for display, editing, analysis or manipulation purposes. Experience with reverse engineering tools that use unitary repositories suggest 10-12,000 lines of source code in the upper ranges of (corresponding) ASTs that can be processed efficiently. At 100,000 lines of code, a figure in the lower ranges of industrial strength systems, such systems become impractical (Doug Foley, personal communication).
Given a YAAC/Bison style grammar and lexical analyzer, Scandura (e.,g., 1987, 1992) has shown how the size of the ASTs necessary to represent a program can dramatically be reduced. Source code is represented in terms of: a) the textual statements or statement-level ASTs, b) module ASTs whose terminal nodes either contain those statements or reference the statement-level ASTs, c) ASTs representing relationships between the AST modules (e.g., call hierarchies) and d) unit level ASTs representing relationships between the files (units) in a software system. In effect, individual statements, modules, call trees, and unit hierarchies are represented as separately although linked ASTs. Partitioning AST representations of source code in this manner (hereafter referred to as partitioned or compressed ASTs) dramatically reduces the complexity of the individual ASTs needed to represent source code. The present disclosure reveals processes whereby source code may automatically be reverse engineered into partitioned ASTs, and whereby partitioned ASTs may automatically be analyzed and/or manipulated.
This reduction in size is especially important when automatically processing ASTs, especially in analysis and manipulation where reference is required to multiple parts of the ASTs. Instead of going up exponentially with size of the ASTs, as when ASTs are used to represent entire programs, experience shows that increases in processing time are essentially linear.
Retaining full statements in a language as text in terminal elements of module ASTs makes it much easier for humans to understand and modify the associated code (e.g., Scandura, 1987). Representing statements as text in terminal nodes of module ASTs also facilitates human editing. Consequently, module ASTs, as well as the higher level ASTs linking them, may be referred to as compressed ASTs. (Full source code can be generated from such ASTs as desired.)
The present disclosure shows how automated processes for reverse engineering source code into ASTs may be constructed with automated support, requiring a minimum of human input. Once constructed, it also shows how automated analysis and/or conversion processes may be constructed in a highly efficient manner, again with minimal human support. Special attention is given to constructing automated analysis and/or conversion processes by reference to smaller, more easily understood ASTs.
ASTs can be displayed by writing routines that operate directly on those trees. Displays also can be constructed by converting ASTs into a form that can be used by an existing display technology. Software systems are commonly available for displaying information in ASTs as simple tree views, data flow diagrams, structure charts, Nassi-Shneiderman charts and Flexforms (formerly known as FLOWforms, Scandura, U.S. Pat. No. 5, 262,761) among others.
Although GUI interfaces and/or automated aides may facilitate the process, programmers still need to address either the full complexity of a grammar or of the corresponding uncompressed AST. To display information, for example, the programmer must either write commands for each production in a grammar or write a full display program, which analyzes and manipulates all of the information in the AST. (The latter code may either operate directly on ASTs or convert such trees into a form that can be used by existing code).
In effect, the programmer must deal with the full complexity of either the grammar or the corresponding AST. Attributes derived by manipulating tokens associated with any one production may effect what manipulations are to be performed with another production. Complications deriving from such context dependencies increase rapidly both with the number of dependencies and the length of the grammar. Cyclic relationships in grammars further complicate the situation. Similarly, levels in a traditional AST range from atomic level terminal tokens at the bottom of the tree to the highest level abstractions. The programmer must be concerned both with assembling low-level tokens as well as addressing higher level relationships.
Given the large variety of programming languages and dialects, not to mention the indefinitely large number of ways of manipulating code, it is not practical to develop manual solutions for more than a small fraction of the many possible code variants. The present disclosure reveals a method that dramatically simplifies the process of constructing automated processes for reverse engineering source code into compressed ASTs that can more easily be displayed, understood, analyzed, edited and manipulated.