The present invention relates to an apparatus for optimizing a program including multiple instructions, by dynamically rewriting the program. The present invention also relates to an optimization method, as well as to a computer-readable optimizing program for implementing the method.
The programs causing computers to perform a predetermined operation are written by using programming languages. The programming languages include: a machine language, which can be directly interpreted by the computers for execution; a low-level language such as an assembly language which corresponds one-to-one to its machine language and is close to the machine language; and a high-level language such as C, C++, or Java (registered trademark), which does not correspond one-to-one to the machine language and is written in a text format more easily understandable by human beings.
The computers cannot directly interpret and execute a program written in assembly language or high-level language. For this reason, the computers translate the program into the machine language directly interpretable by the computers by software called an assembler or a compiler, and then execute the program.
The programs in the assembly language are used in cases where: it is desired to perform a further optimization for the execution of an application having a limitation on the execution speed or program size, than the optimization achievable by the ability of the compiler; a programmer needs to control CPU operation; a resource such as a memory capacity or an arithmetic execution speed or the like is limited; and the like, for example. In addition, the assembly language is also used for development of a kernel and a device driver, and the like.
The programs in the high-level language are widely used in general since they can be written in a text format. The programs are translated into the machine language by a compiler. At this time, the programs are translated into the assembly language and outputted once, and then translated into the machine language by an assembler.
The assembler includes a type called an inline assembler, which allows a description in the assembly language to be included in a program written in the high-level language. The inline assembler optimizes a program by replacing a portion of the program that occupies most of the execution time with a description in the assembly language and thus achieves an execution speed close to that of the program written in the assembly language.
Heretofore, programs have been optimized to increase the execution speeds thereof by using the programs written in the assembly language described above or by using the inline assembler. As a program optimization technique, a technique to dynamically rewrite a binary sequence including multiple instructions is known.
There is a technique to dynamically rewrite a binary sequence by using a special instruction of a CPU such as a compare-and-swap (CAS) instruction, for example. With this technique, a value of a certain memory location is read and then stored, a new value is calculated on the basis of the value, and thereafter, the new value is stored in the memory location. During this processing, whether or not the value used in the calculation is recorded is checked at the time of rewriting, and if the value is not recorded, the processing is performed again from the beginning. Accordingly, a conflict between processors can be prevented, and the memory location can be atomically checked and changed.
Furthermore, there is a technique to replace a load and store instruction with a branch instruction to branch to the patch area (refer to Bowen Alpern, Mark Charney, Jong-Deok Choi, Anthony Cocchi, Derek Lieber, “Dynamic Linking on a Shared-Memory Multiprocessor,” [online], 1999, the Internet <URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23 .8877&rep=repl&type=pdf>, for example). This technique is for backpatching instructions in a symmetric multi-processing environment. With this technique, only the first word of the original code is changed, and a synchronization protocol is used to ensure that all processors eventually see the backpatched code.
To put it more specifically, “nop,” which is the first word shown in FIG. 1A, is changed to “jmp Label3” as shown in FIG. 1B. Here, “nop” is an instruction that means no operation and is put as a dummy in a location where an instruction is planned to be added later. In addition, “jmp” is a branch instruction and has a conditional branch, which branches only if a certain condition is true, and also has an unconditional branch, which branches unconditionally. FIG. 1B shows the unconditional branch and indicates that an unconditional branch to “Label3” is made. An instruction sequence in the machine language is sequentially executed. Thus, the instructions are executed in the order of “Label1,” “Label2, ” “Label3” and then back to “Label2.” However, with the aforementioned change, “Label3” is firstly executed while “Label1” and “Label2” are skipped, and then “Label2” is executed. Thus, optimization of the program can be achieved.
Furthermore, there is a technique to dynamically rewrite a binary sequence, which is an instruction sequence, by using a hardware transactional memory (HTM) (refer to United States (US) Patent Application Publication No. 2009/0006750 Description, for example). In multi-core processors, a shared-memory parallel program in which threads executed in parallel share and handle data is often used. Here, as the techniques to prevent an access conflict, there is a technique that uses a lock operation and a technique called HTM, which uses no lock operation.
The HTM is a transactional memory implemented as hardware. With the HTM, instead of locking a shared resource previously, each thread in execution holds its copy locally. Then, after confirming at the end of processing that there is no change between the value of the shared resource and the value of the copy held locally, the tread writes the results of the processing at once. Here, if another thread rewrites and changes the value in the shared resource, the tread discards the completed processing and then performs the processing again. Here, a transaction refers to processing of a series of processing operations executed by a single thread.
With this technique, when a binary sequence is rewritten, the binary sequence is written by a transaction. Thus, unless the processing of a transaction conflicts with the processing of another thread, the binary sequence can be rewritten. In addition, even if there is a conflict, the processing is re-executed from the beginning, so that the binary sequence can be rewritten unless a conflict occurs during the re-execution.
However, the above techniques have their shortcomings. First, with the aforementioned technique to dynamically rewrite a binary sequence by using the CAS instruction, rewriting of only one to two instructions at most is possible. With the CAS technique, a binary sequence including multiple instructions cannot be dynamically rewritten.
Second, with the aforementioned technique using replacement with the branch instruction of the Alpern reference other problems exist. For example as the number of branch instructions increases, so does the code size. Thus, the instruction cache misses increase. Moreover, with this technique as well, only one instruction is rewritten. Thus, a binary sequence including multiple instructions cannot be dynamically rewritten.
Third, with the aforementioned technique using the HTM of the 2009/0006750 reference, a binary sequence can be dynamically rewritten, but a function call cannot be included in a rewrite target binary sequence. For example, consider a case where a thread 2 rewrites a binary sequence while a thread 1 is executing the original binary sequence. In this case, if the function being executed by the thread 1 is a call destination function, rewriting of the binary sequence by the thread 2 does not fail because the original binary sequence is not executed at this point. Thus, when the thread 1 returns to the call source, the binary sequence has been rewritten in this case. Accordingly, it is impossible to obtain the correct execution result.
Fourth, with respect to the optimization technique called “inlining,” which expands a function in the call site, other problems exist. With the inlining technique, for the purpose of preventing an increase in the code size, inlining is ended when the code size increases to a certain threshold. In addition, inlining cannot be performed in a case where the entity of a call target function is unknown at the time of compiling in a language such as Java (registered trademark), which dynamically loads a class. For this reason, in order to inline a function call left without being inlined, the original function needs to be recompiled. Accordingly, there arises a problem that use of the inline technique incurs a compile cost in this case.
Fifth, among the optimization techniques using the HTM, there is an optimization technique for optimizing processing involving a speculatively executed branch by removing a limitation due to control dependency from a hot trace. With this technique, a trace is generated by deleting a branch destination having an instruction causing an abort in a conditional branch at the time of compiling. Then, if the trace causes aborts frequently at the time of execution, the trace is discarded, and a new trace is generated by recompiling. Since recompiling is required in this case as well, there arises a problem that use of the technique incurs a compile cost.