1. Field of the Invention
The present invention relates to compilers and, more particularly, to methods and apparatus suitable for generation of improved compiler products.
2. Description of the Related Art
A computer program is typically written in a high-level programming language, such as Fortran, C, C++, Java, etc. Computer programs written in such high-level programming languages are also referred to as source programs. Source programs are composed of one or more source files. A compiler converts a source file into an object file. A linker combines one or more object files into an executable program. A computer that interprets (runs) the executable program will behave as specified in the source program.
Similar to textual writing, a high-level programming language (namely, source code) is written as a sequence of characters. As in textual writing, these characters are grouped into “words”, which are called tokens. These tokens include identifiers, keywords, operators, and punctuation. FIG. 1 is portion 102 of a representative C++ program. The portion 102 includes: identifiers “foo”, “x”, and “y”; keywords “namespace”, “int”, and “long”; punctuation “;”, “{”, and “}”, but does not include any operators.
Source programs typically use a sequence of tokens to name program symbols. The portion 102 illustrated in FIG. 1 declares three program symbols, a “namespace”, an “int” variable, and a “long” variable. The “namespace” is a container, and has the two variables as its members. Typically, programs use short context-dependent names for these symbols. The names are “foo”, “x”, and “y”, respectively. Occasionally, programs may use longer, and more complex, context-independent names. They are “::foo”, “::foo::x”, and “::foo::y”, respectively. These names include the names for the symbol's container as well as the identifier for the symbol itself.
Source programs may declare symbols with identical context-dependent names, but different context-independent names. FIG. 2 illustrates another portion 104 of a representative source program written in C++ and serves to declare a “namespace” and two “float” variables. These variables have context-dependent names “x” and “y”, which are identical to the “int” and “long” variables declared in FIG. 1. However, their context-independent names are “::babo::x” and “::babo::y”, which are different from the context-independent names for the other variables.
In addition to the compiler, many other tools may need to refer to C++ program symbols. These other tools include a linker, a debugger, a performance analyzer, etc. Because the context for symbols' names is generally not accessible to other tools, they must refer to symbols with the longer, context-independent names. Moreover, these tools generally do not, and should not, understand the idiosyncratic syntax of the long names. Therefore, the compiler encodes a long, context-independent name into a single identifier. This encoded name is also called the linker name because the linker is the only required tool that must use the encoded name. For the purposes of most tools, the encoded name is the only name for a symbol, which avoids idiosyncratic processing. Thus, encoding symbol names are a useful mechanism and an important part of the compiling process.
An example encoded name for the symbol identified by “x” in FIG. 1 is “—1cDfooBx_”, where “—1” is a prefix identifying the particular encoding algorithm, “c” encodes the kind of symbol (function or, in this case, variable), “foo” is the context-dependent name for the container of the symbol, “D” is the length of the string “foo”, “x” is the context-dependent identifier for the symbol, “B” is the length of the string “x”, and “_” is the name terminator. Likewise, the encoded name for the symbol identified by “x” in FIG. 2 has the encoded name “—1cEbaboBx_”.
Unfortunately, however, the encoded symbol names are typically substantially longer than the original context-dependent names. As can be seen in FIGS. 1 and 2, even in the simplest cases, the encoding may result in encoded identifiers that are ten times longer than the original context-dependent identifier. Moreover, in more practical applications, symbols can have many levels of containers, with many containers having very complex names. In such cases, the length of the encoded identifiers for symbols can be very long. Encoded identifiers in excess of 5000 characters have been reported in some applications. The length of these identifiers is further compounded when they are used as part of application-specific data for non-critical applications. For example, debuggers typically require more data than is necessary for strict interpretation of the program. This data will use encoded names, which makes the size of the debugging data sensitive to the size of the encoded identifiers.
In any event, the encoded identifiers are used in the generation of computer products, such as object programs, executable programs, debugging information, etc. As a result, the length of the encoded identifiers will have a substantial impact on the overall size of the compiler products. Therefore, having long encoded identifiers in computer products not only adversely affects compilation time, but also yields compiler products that require large amounts of storage.
Conventionally, some efforts have lead to reduction in the size of encoded function names in isolated circumstances. In programming languages with overloaded functions, such as C++, the encoded name must often include a description of function parameter types. Since such descriptions could become long and are often repeated, one conventional approach has avoided the need to fully specify repeated parameters and thus reduce the size of an encoded name. This conventional approach employs a special marker noting that a parameter is repeated some number of times, thus requiring the complete description of a parameter type need only be done the first time it appears in a parameter list. Further, programming languages with very complex names, such as C++, often have duplicated types and symbols across the entire structure of the name, not just among parameters. Since these duplicated types and symbols could become long and are often repeated, avoiding the complete specification of duplicated types and symbols can also help reduce the size of an encoded name. One conventional approach has been to assign a unique number to each type and symbol during encoding, and then emit a special marker and the type or symbol number for a duplicate type or symbol, rather than emitting the full encoding for the type or symbol. These conventional approaches have not satisfactorily reduced the size of encoded symbol names or the size of compiler products and thus the problems mentioned above remain.
In view of the foregoing, there is a need for improved methods and apparatus for generating improved compiler products, specifically, reducing the storage impact of encoded names.