The present disclosure relates to techniques for representing and handling a system, in particular for representing and handling a system that can be represented as a heterogeneous tree and/or graph structure. Such a system may, for example, be a language such as a computer-programming language, or a portion thereof.
The present disclosure also relates to techniques for generating an implementation of a data structure representative of such a system, and further to techniques for handling an instance of the system using such an implementation. In the example context of the system being a computer-programming language, such an instance may be a code portion expressed in that language.
Such a code portion may be considered to be a structured expression, comprising a set of interrelated symbols. Techniques embodying the present invention may be employed to perform analysis and/or manipulation in respect of the code portion. In this regard, the present disclosure relates to parsing and compiling techniques and to metaprogramming techniques.
Languages are one example of a type of system to which embodiments of the present disclosure may be applied. Languages are a form of code, and computer-programming languages are one example of code that is heavily used in modern-day technical systems. Although the present disclosure is hereinafter mainly presented in relation to computer programs, it will be appreciated that the disclosure may apply to other types of code equally (and indeed to systems other than languages).
In general, a language can be considered to be a system of arbitrary symbols and rules that define how these symbols may be manipulated. That is, languages are not just sets of symbols. They also contain a grammar, or system of rules, used to manipulate the symbols. Whilst a set of symbols may be used for expression or communication, the set of symbols alone is actually relatively inexpressive because there are no clear or regular relationships between the symbols. Because a language also has a grammar, one can manipulate its symbols to express clear and regular relationships between them. A programming language is an artificial language that can be used to control the behavior of a machine, particularly of a computer. Programming languages, like human languages, are defined through a use of syntactic and semantic rules, to determine structure and meaning respectively.
Evaluating code can be a time- and labor-intensive task. Such evaluation may, for example, include understanding the code, using the code to perform a task, and manipulating the code to alter it in some way.
One example of an activity that involves code evaluation is metaprogramming. On the broadest level, metaprogramming can be considered to be (for example) analyzing, manipulating, reconfiguring, improving or simplifying existing program code. Metaprogramming can also be considered to be the writing of computer programs that write or manipulate other programs (or themselves) as their data, or that do part of the work during compile time that is otherwise done at run time. In many cases, this allows programmers to get more done in the same amount of time as they would take to write all of the code manually. Metaprogramming usually involves the dynamic execution of string expressions that contain programming commands. Thus, programs can write programs.
A metaprogramming procedure typically involves being able to look at, or represent, program code in a useful and informative way. The evaluation of computer programming code, for example for the purpose of metaprogramming, normally involves the task of parsing. Parsing (also known as “syntactic analysis”) is the process of analyzing a sequence of tokens (normally extracted from a portion of code) to determine its grammatical structure with respect to a given formal grammar for the language concerned. Parsing transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input code. Lexical analysis creates tokens from a sequence of input characters (i.e. from the input code) and it is these tokens that are processed by a parser to build a data structure such as a parse tree or abstract syntax tree for the instance concerned (i.e. for the code portion).
The most common use of a parser is as a component of a compiler. This parses the source code of a computer-programming language to create some form of internal representation. By way of example, FIG. 1 is a schematic diagram that demonstrates a simplified example of parsing input code written in a particular computer programming language. In this example, the computer programming language has two levels of grammar, namely lexical and syntactic.
As can be seen from FIG. 1, the first stage of parsing can involve token generation, or lexical analysis, by which the input character stream of the input code is split into meaningful symbols defined by a grammar of regular expressions. In the example of FIG. 1, the input code “a+b*c” is examined and split into the tokens a, +, b, * and c, each of which is a meaningful symbol in the context of an algebraic expression. The parser would contain rules to tell it that the characters * and + mark the start of a new token, so meaningless tokens like “a+” would not be generated. The next stage is syntactic parsing or syntactic analysis, which is checking that the tokens form an allowable expression. A result of this stage could be the building of a data structure, such as the abstract syntax tree shown in FIG. 1. A final phase (not shown in FIG. 1) could be semantic parsing or analysis, which involves working out the implications of the expression just validated and taking the appropriate action. In the case of a compiler, this could involve the generation of object code from the input source code.
An abstract syntax tree is a data structure that emulates a tree structure with a set of linked nodes. An abstract syntax tree is commonly used to represent the “inherent” logical structure of a code portion. Such a logical structure is considered “inherent” because it is ultimately based upon the grammar for the language concerned. For example, the abstract syntax tree of FIG. 1 may be considered to represent the inherent logical structure of the input code stream of FIG. 1. However, this structure is only inherent because of the rules of precedence defined in the grammar for the language concerned. It is therefore assumed in FIG. 1 that the grammar for the language concerned states that the multiply sign * takes precedence over the addition sign +. Therefore, based on this grammar, the “inherent” logical structure associates the operand tokens “b” and “c” with the operator token “*”, and similarly associates the operand token “a” and the combined operand token “b*c” with the operator token “+”.
A node may contain a value or a condition, or represent a separate data structure or a tree of its own. Each node in a tree has zero or more child nodes, which are below it in the tree (by convention, trees of this sort grow downwards). A node that has a child is called the child's parent node. The uppermost node in a tree is called the root node. Being the uppermost node, the root node will generally not have any parents and is the node at which operations on the tree commonly begin (although some algorithms of course may begin at other nodes of the tree, for example at the leaf nodes). All other nodes in the tree can be reached from the root node by following the links between the nodes. Such links are commonly referred to as edges. Although a tree has only one root node, other nodes in a tree can be seen as the root node of a subtree rooted at that node. Nodes at the lowermost level of the tree are called leaf nodes, or terminal nodes, or simply terminals. Since they are at the lowermost level, they do not have any children. An internal node (or inner node, or branch node, or non-terminal node, or simply non-terminal) is any node of a tree (other than the root node) that has child nodes, and is thus not a leaf node. A subtree is a portion of a tree data structure that can be viewed as a complete tree in itself.
An abstract syntax tree (AST) may be defined as a finite, labelled, directed tree, where the internal nodes are labelled by operators, and the leaf nodes represent the operands of the operators (as is the case in FIG. 1). Therefore, the leaves are null operators and only represent variables or constants. An AST is normally used in a parser as an intermediate between a parse tree and a data structure, the latter of which is often used as a compiler or interpreter's internal representation of a computer program while it is being optimized and from which code generation is performed. An AST differs from a parse tree by omitting nodes and edges for syntax rules that do not affect the semantics of the program. For example, grouping parentheses are normally omitted from an AST, since the grouping of operands is explicit from the tree structure. This again can be appreciated from consideration of FIG. 1. In FIG. 1, the input code stream could for example have been “a+(b*c)”, the grouping parentheses making the rules of precedence clear. The abstract syntax tree for this code stream would however still be the same as that shown in FIG. 1, the intended grouping (or rules of precedence) being explicit from the tree structure.
It will be appreciated that a tree in the context of the present disclosure is an example of a logical structure or a data structure. Typically, each node other than the root node in such trees has at most one parent node. However, in the context of the present disclosure, it will become apparent that such “trees” are employed where some nodes have more than one parent, such that the structure becomes more like a Directed Acyclic Graph (DAG) than a tree. Accordingly, although the present disclosure is predominantly described with respect to logical structures and data structures taking the form of abstract syntax trees, it will be appreciated that structures other than traditional trees, e.g. graphs, are intended. Trees can be considered to be a special form of graph. In graph theory, a tree is a connected acyclic graph. DAGs can be considered to be a generalization of trees in which certain subtrees can be shared by different parts of the tree. In a tree with many identical subtrees, this can lead to a drastic decrease in space requirements to store the structure.
It is emphasised that the data structures (e.g. representing abstract syntax trees and/or graphs) discussed above with reference to FIG. 1 are considered to be “instance” data structures, because they represent instances of the system concerned. The input code of FIG. 1 is an instance of the language in which it is expressed, or, put another way, it is an expression written in that language.
Systems of the present disclosure, such as languages, may themselves be represented by data structures. Those “system” data structures may take the form of abstract syntax trees and/or graphs. For example, such a “system” data structure for a language may represent the organization of components of the language and its rules of grammar, such that expressions in that language are instances of that data structure (i.e. “instance” data structures).
Typically, a tool for handling an instance of a system employs an implementation of the system in order to carry out such handling. For example, such a tool may employ an implementation of the “system” data structure representative of the system in order to handle system instances of the system. In the context of a computer-programming language, the tool may be a parser or compiler and may use an implementation of the abstract syntax tree representative of the language to handle code portions written in that language. For example, the tool may generate an “instance” data structure (which itself may be an abstract syntax tree) representative of the code portion based on the implementation of the “system” data structure.
Focussing on computer programs (code portions that are instances of a computer-programming language), there are a number of different types of AST that can be used to represent them. The most common type of AST is a heterogeneous tree structure, where each type of construct in the tree is represented by a specific data structure, since this provides a compact representation that can have construct-specific behaviours associated with it. This type of “instance” structure can be implemented by code that is manually written, but is more typically generated automatically by a software tool (for example, an AST code generator) that takes a concise description of a desired logical structure (representative of an implementation of the system, in this case of the language) as its input and, given a specific candidate code portion also input to the tool, generates, as its output, means for enabling an “instance” data structure to be generated based on the “system” logical structure in order to implement the candidate code portion. In this context, the term implementation may be considered to refer to the act of enabling an “instance” data structure to be generated based on the “system” logical structure in order to implement the candidate code portion concerned. There may also be provided an interface for using the generated data structure.
It has been found that existing tools and techniques for implementing “system” data structures, and for using the implementation to generate and handle “instance” data structures representative of instances of the system concerned, suffer from a number of problems. In particular, it has been found that existing implementations in the field of computer programming are inflexible, and cause problems for techniques such as metaprogramming. Those existing tools and techniques have been found to be complicated to use, and to involve a large degree of time and effort from the programmer. The technical features of these existing tools and techniques responsible for these problems will become apparent later herein. Nevertheless it is desirable to solve some or all of such problems.
It is desirable to provide an implementation of a system, such as a computer-programming language, which enables instances of that system, such as code portions, to be handled (manipulated, evaluated, analysed, modified, transformed, etc.) in a flexible and efficient way.