Field
The present disclosed embodiments relate generally to compiling programs of high-level languages, and more specifically to compile-time operations of high-level languages.
Background
A large number of embedded processors are deployed in cost-sensitive, but high-volume markets where even modest savings of unit cost can lead to a substantial overall cost reduction. Embedded processors typically use a systems-on-chip (SoC) architecture, where a plurality of processors are arranged with other components such as memories and peripherals on a single SoC. Memory typically occupies the largest portion of an SoC, and hence contributes most to the overall cost. Memory has to be large enough to store a full image of executable code (or the binary executable). As a consequence, any reduction in code size translates directly to equivalent savings of die area and, eventually, unit cost.
Computer software is often written using coding languages that are interpretable to humans, but not to machines. These high level languages include C, C++, Fortran, and Java, to name just a few. In order to run such “high level” programming languages (e.g., c=sum(a+b)), a compiler translates the source code into assembly code (e.g., LOAD A 134 DIV SAVE C 101 STOP), a low-level language whose commands have a 1:1 correspondence to the machine instructions understood by the computing device hardware, and finally to binary machine code (e.g., 100100 100010101 100100111111), which is directly readable by the computing device hardware. Then, when a user runs the software on the computing device, the operating system reads the machine code and executes it on the computing device hardware.
Large source codes often contain duplicates of blocks of code or near duplicates that results in bloated machine code (increased code size), and poor instruction cache hit rates. Duplicate code can be caused by copying and pasting of code by software developers; use of certain programming techniques (e.g., C macros); and may be caused as an artifact of language implementations (e.g. C++ templates). Bloated machine code can unnecessarily tax memory resources, while poor cache hit rates can degrade performance. Thus, various optimizations have been added to compilers to reduce duplicate or near duplicate instances of code during the compilation process to assembly.
For many optimizations during code compilation of an application expressed in a high level programming language (e.g., C, C++, Fortran, Java) a quick similarity assessment between large code fragments is desired. However, detecting near-duplicate blocks of code as well as removing duplicated and near-duplicated code, is non-trivial. A full comparison is most often prohibitively complex/slow, and historically it has proven to be extremely hard to solve the semantic comparison problem, especially in a reasonable amount of time. Simple string matching cannot be used to detect near duplicates, so an algorithm would essentially have to compare every code block to every other code block, statement by statement, to identify duplicates or near-duplicates. De-duplication is equally non-trivial since duplicated code needs to be extracted from locally modified code without changing the meaning of the program (program semantics).