1. Field of the Invention
Embodiments of the present invention relate, in general, to systems and methods for managing duplicate code and particularly to eliminating functionally equivalent code at link time.
2. Relevant Background
Computer systems carry out processes defined by a collection of instructions. These instructions are defined in high level programming languages such as Basic, Fortran, C, C++, and the like. Using compliers, linkers, and assemblers these processes are converted into instructions which can be executed by a machine. While the high level language (also referred to herein as source code) is readily understandable by humans, its conversion into machine language, also known as executable code, transforms the instructions into a series of zeros and ones that is essentially indecipherable by a human.
Software of any considerable size and function is typically written in modules. There are considerable advantages to developing code in such a modular format but along with such advantages also come disadvantages. These modules are discrete functional portions of the overall software package that, when combined, interact to form the desired software product. Each of these modules is generally compiled separately into what is commonly known in the art as object code. The various modules of object code can be easily relocated and linked forming a product that can be executed by a machine.
As is known to one skilled in the relevant art, a compiler attempts to optimize each module of source code by using relatively simple rules and functions. For example, a compiler can recognize a certain source code command as carrying out a series of steps such as a summation or a multiplication. Rather than generating source code for the steps necessary to access and manipulate various registers for a simple function, the compiler identifies the process as one that is common and implemented in a standard fashion.
However, each module of a large software product is typically compiled individually. A compiler cannot examine the entire software product from a global perspective to view the behavior or the role that any one module may play. As a result, functionality developed within individual modules is often duplicated by other modules. It is well known in the art of computer science and software programming that source code is written with a great deal of duplication. While each module attempts to efficiently achieve its assigned task, each is written in relative isolation, and achieving that task is likely to comprise the same functionalities of another module within the software product. This form of duplication is compounded by compilers generating the same assembly code for different source code constructs.
The result is that within the machine executable code of a particular process there exists a vast amount of duplication. This duplicate code increases the overall size of a project requiring additional valuable storage capacity and it can also slow the overall product performance due to an increased I-cache latency access. It is estimated that common computer systems such as Linux, Windows, and Java are composed of as much as 20-30% duplicate code. Generally the culprit of such duplication is an over reliance on high level programming language abstractions. Abstractions are difficult to conceive and use thus, once formed, the tendency is to duplicate them rather than modify them for efficiency. Thus software machine code is littered with portions of code that are either exactly or functionally identical to other portions of code.
Linkers do possess a global view of all of the modules linked together to form a software product. Recall that linkers function to join these modules together into the overall software product. However, linkers possess limited functionality. Generally linkers collate code and data and form a binary file for execution and, in some cases, identify and remove code that possesses unused functions. This is referred to in the art as removing dead code. However, two separate blocks of code possessing the same functionality accessed by separate portions of the product would go unnoticed by the linker. Linkers lack the necessary information to reliably disassemble the code sections into functions and basic blocks from which a duplication of code could be detected. Linkers also lack the ability to manage such duplicate code once found.