1. Field
Automatic parallelizers within binary rewriters may be relevant to the field of computing. Specifically, such binary rewriters may improve both the functional structure of computer programs and the physical structure of their recording media in a variety of ways.
2. Description of the Related Art
Many infrastructures and tools have been built for doing either binary rewriting, object-code rewriting, or just disassembly without rewriting of the code. These include IDA, Objdump, Etch, Squeeze and Squeeze++, Dynlnst, OM, ATOM, ALTO, PLTO, Spike, and Diablo. Of these, IDA and Objdump are disassemblers only and do not attempt to rewrite code.
Binary rewriters are tools, often implemented using software running on hardware, that accept a binary executable program as input, and produce an improved executable as output. The output executable usually has the same functionality as the input, but is improved in one or more metrics, such as run-time, energy use, memory use, security, or reliability.
Binary rewriting can provide advantages even to highly optimized binaries produced by the best industrial-strength compilers. The reasons are that separate compilation is an important practical requirement; hence most compilers compile each procedure separately. In contrast binary rewriters can perform inter-procedural optimizations missing in even optimized code. Additionally, it is more economically feasible to implement a transformation once in a binary rewriter, rather than repeatedly in each of the many compilers for an instruction set. Additionally, unlike compiler-implemented technology, when a code transformation is implemented in a binary rewriter, it is applicable to code produced from any programming language, including assembly code, with no additional effort. Finally, binary rewriters can be used to enforce security rules on to-be-executed code. Although a compiler might, in theory, be able to enforce security, since the developer may, maliciously or otherwise, simply not use a compiler with security enforcement a binary rewriter can be used to enforce security rules.
Binary rewriting has many applications including inter-procedural optimization, code compaction, security-policy enforcement, preventing control-flow attacks, cache optimization, software caching, and distributed virtual machines for networked computers.
The reason for the great interest in research in binary rewriting is that it offers many features that are not conventionally available with compiler-produced optimized binaries. For example, binary rewriters can have the ability to do inter-procedural optimization. Many existing commercial and open-source compilers use separate compilation, i.e., they compile each procedure separately and independently from other procedures. The reason for this separate processing is that programs are typically distributed among several files, and to keep compile times low in the typical repeated debug-recompile cycle during development, it is important to only recompile files that have changed since the last compile. Thus, files are compiled separately. To maintain correctness for functions called across files, this usually implies that functions must also be compiled separately. For example, this is the case with GCC, the most widely used open-source compiler used commercially, even with the highest level of optimization.
In contrast, binary rewriters have access to the entire program, not just one procedure at a time. Hence, unlike in a separate compiler, inter-procedural optimizations become possible.
Another difference between binary rewriters and compilers is increased economic feasibility. It is more economically feasible to implement a code transformation once for an instruction set in a binary rewriter, rather than repeatedly for each compiler for the instruction set. For example, the ARM instruction set has over thirty compilers available for it, and the x86 has a similarly large number of compilers from different vendors and for different source languages. The high expense of repeated compiler implementation often cannot be supported by a small fraction of the demand.
Furthermore, binary compilers can work for code produced from any source language using any compiler. A binary rewriter works for code produced from any source language by any compiler.
Additionally, binary compilers can work for hand-coded assembly routines. Code transformations cannot be applied by a compiler to hand-coded assembly routines, since they are never compiled. In contrast, a binary rewriter can transform such routines.
Consequent to these advantages, a number of binary rewriters, disassemblers and object-code rewriters have been built, mostly in academia. These include IDA, Objdump, Etch, Squeeze and Squeeze-++, Dynlnst, OM, ATOM, ALTO, PLTO, Spike, and Diablo.
Meanwhile, a more specific area of programming that has been underdeveloped is parallelization. Increasing transistor budgets have made multiple cores the industry norm for commodity processors. A cessation of clock speed improvements has made it imperative to gainfully use multiple cores to sustain continued improvements in execution times. One challenge is to improve the run-time of single programs. Programs can be rewritten in an explicitly parallel manner to take advantage of multiple cores. However rewriting programs by hand is extraordinarily time-intensive and expensive, especially considering the vast repository of serial code worldwide, developed at enormous expense over the last several decades.
Extracting parallelism automatically from serial programs has been done using compilers. For example, compilers such as Polaris, SUIF, and pHPF, PROMIS and Parafrase-2 automatically parallelize affine-based loops. A compiler by Kathryn S. McKinley parallelizes loops with arbitrary control flow. Non-loop parallelism has been extracted in compilers such as OSCAR, PROMIS and CASCH.
Automatic parallelization in a compiler is an alternative to rewriting code by hand. However such an idealized automatic parallelizer has been elusive. The reality is that most commercial parallelizing compilers have not implemented the parallelization technologies developed in research, keeping their benefits out of reach. This lack of ‘real-world adoption is because of practical difficulties, like the need to repeatedly implement complex parallelizing technologies in multiple compilers from various vendors, each further specialized to different source languages, for a given ISA. Since each compiler only has a small fraction of the total compiler market, compiler implementation of parallelization is not economically viable.
Despite all the advances in research in automatic parallelization, resulting in several prototype research compilers; commercial adoption of these technologies has been very limited. Indeed, very few commercial compilers available today for commodity processors use any automatic parallelization techniques. Possible reasons for this include the fact that of complex parallelization technologies. Automatic parallelization methods can be very complex and mathematical, and take significant effort to implement. Also, automatic parallelization methods for compilers must conventionally be re-implemented in every compiler. The total market for such a parallelizer is divided among the many compilers typically available for most instruction sets. For example, the ARM instruction set has over 30 compilers available for it, and the x86 has a similarly large number of compilers from different vendors and (sometimes) different source languages. The high expense of repeated compiler implementation often cannot be supported by a small fraction of the demand for just for that compiler. Additionally, there is a widespread belief that non-scientific programs do not have much parallelism. Hence most non-scientific code developers are content with good serial compilers. This low demand has resulted in little incentive for compiler companies to pay for the significant investment needed to build a parallelizing compiler.
Parallelizing compilers in years past were often evaluated by their ability to exploit scalable parallelism as the number of processors was scaled to large numbers, such as 32 or 64 processors. Typically only some scientific codes met this test of success; for other (typically non-scientific) codes, automatic parallelization was deemed to have “failed,” since their speedups were low and did not scale.
There does not appear to be any prior existing method or apparatus that can rewrite a binary program while performing automatic parallelization of the input binary program, and writing the thus parallelized binary program as output. There also does not appear to be any automatic parallelizer inside of an object code program rewriter.