1. Field of the Invention
The present invention relates to a computer program optimization method, and in particular, to a method for editing a program optimization area in order to efficiently perform optimization by rewriting a conditional branching portion of a program to obtain a portion for parallel execution.
2. Description of the Related Art
Generally, in the process during which the source code of a program written in a programming language is compiled, optimization of the program is performed in order to improve the speed at which the program can be executed by a computer.
A variety of optimization methods are in use today. For example, processors, such as CPUs produced by Intel and Hewlett Packard, which are compatible with IA-64 architecture and capable of performing parallel processing while using VLIWs (Very Long Instruction Words), can execute predicated instructions and perform parallel processing at the instruction level. These processors can delete a branching instruction and execute, in parallel, a branching instruction group in a complementary predicated manner (e.g., a conversion of an instruction group hereinafter referred to as an IF conversion). As a result of an IF conversion, the number of instructions can be reduced and a branching prediction failure can be avoided and thus program execution efficiency can be increased.
An IF conversion may also deteriorate execution efficiency, depending on how the instructions for the conversion are selected. The causes of a deterioration in efficiency may be an excessive parallel level extending beyond the hardware limit, a rise in the register pressure, an inferior balance of the critical path length of the branching instruction, or the insertion of a rarely executed instruction.
In order to determine the program execution efficiency level while taking these factors into account, a code schedule must be executed for each branching instruction, both for the case where an IF conversion is to be performed and for the case where an IF conversion is not performed, and the number of actual instruction cycles estimated for each of the two cases and compared.
However, when the number of instruction cycles is estimated and compared for all the branching instructions in a program, the number of combinations becomes enormous, and termination of the estimation process within a realistic calculation time is not possible. Thus, an area for the execution of an optimization (e.g., hereinafter referred to as a hyperblock and shown using a quadrilateral) must be appropriately selected.
Therefore, conventionally, two methods are employed in order to determine within a realistic time whether an IF conversion should be performed. In a first method, an IF conversion is performed, as needed, only for branching instructions along an execution path (e.g., a main trace) for which it is predicted that the execution probability is the highest. In a second method, an IF conversion is performed, as needed, for temporarily optimizing all the branching instructions, and for performing, as needed, a conversion in the direction opposite to that of an IF conversion (e.g., a reverse IF-conversion) during the list scheduling and the reproduction of the branching instructions,
The conventional technique for performing an IF conversion, as needed, only for branching instructions in the main trace (e.g., the first method) is disclosed in xe2x80x9cEffective Compiler Support For Predicated Execution Using The Hyperblockxe2x80x9d (S. A. Mahlke, R. E. Hank, R. A. Bringmann, in Proceedings of the 25th International Symposium on Microarchitecture, pp.45-54, December 1992).
The conventional technique of the first method provides a solution using a discovery method to determine for which area should parallel execution and an IF conversion be performed, in order. According to the article, first, the main trace is specified, and an IF conversion is unconditionally performed for this path. Then, a check is performed to determine whether each path other than the main trace (e.g., a sub trace) should be included in the same parallel execution area, and the area for which an IF conversion is performed is gradually increased.
Whether an IF conversion should be performed for a specific branching instruction is determined considering four conditions including whether in the sub trace there is an instruction that will disturb a pipe line, the probability that relative to the main trace a sub trace will be executed, the ratio of the number of instructions in the machine language of the sub trace to the number in the main trace, and the limits of the parallel processing hardware capabilities.
According to this method, when the number of branching instructions in the main trace is denoted by n, only the calculation amount proportional to n is required to terminate the determination performed to ascertain whether an IF conversion should be performed.
The conventional technique for temporarily optimizing all the branching instructions by performing an IF conversion and for performing a reverse IF-conversion (e.g., the second method), as needed, during the list scheduling and the reproduction of the branching instructions is disclosed in xe2x80x9cA Framework for Balancing Control Flow And Predictionxe2x80x9d (D. I. August, W. W. Hwu, S. A. Mahlke, in Proceedings of the 30th International Symposium On Microarchitecture, December 1997).
According to this article, first, the overall program is defined as a single parallel execution area, and an IF conversion is performed for all the branching instructions. Then, various optimization processes are performed for the obtained program, and a reverse IF-conversion is selectively performed. As a result, a state is provided where an IF conversion has been selectively performed for the branching instructions.
According to the second method, in coordination with the code scheduler, the number of execution cycles for each branching instruction is obtained both for a case where a reverse IF-conversion has been performed and a case where a reverse IF-conversion has not been performed. Then, the execution performances are compared to determine whether a reverse IF-conversion should be performed.
However, when this method is employed for all the branching instructions in a function, as well as during a determination performed to ascertain whether an IF conversion should be performed, the number of target instruction combinations for a reverse IF-conversion becomes enormous. Thus, a reverse IF-conversion is performed only when the critical path is to be scheduled in coordination with the list scheduler, so that an increase in the number of calculations is suppressed.
In the above-mentioned article by August et al., a method is proposed where scheduling is performed approximately 2n times, where n denotes the number of branching instructions. The critical path is the longest instruction string, in a specific range within a program, of the series of instruction strings that can not be executed in parallel.
However, while the first method, for performing an IF conversion as needed only for branching instructions in the main trace, can improve the execution efficiency for the main trace, it does not especially affect execution efficiency relative to the sub trace. Therefore, overall execution efficiency for a program cannot always be improved.
Further, where there is no particular path whose execution probability is high and for which a main trace can be strictly specified, it is difficult to determine a path along which an IF conversion should be performed. Additionally, even when an IF conversion is performed for a path based on a specific reference, the execution probabilities for other paths are also high. Thus, the program execution efficiency can not satisfactorily be increased.
Furthermore, according to the second method for temporarily optimizing all the branching instructions through an IF conversion, and for performing reverse IF-conversions as needed during the list scheduling and the reproduction of the branching instructions, the target path for a reverse IF-conversion is selected during the list scheduling. However, whether an IF conversion should be performed is determined for a predetermined path (e.g., in this case, whether the branching instruction should be reproduced by a reverse IF-conversion). Since this determination is made by executing code for scheduling both a case where an IF conversion is performed and a case where an IF conversion is not performed, and by comparing the results, a large number of calculations are required.
That is, in this conventional technique, as well as in the determination performed to ascertain whether an IF conversion should be performed, in principle, a tradeoff exists between the size of an area for which an IF conversion is employed and the speed of the compiler.
As described above, according to these conventional techniques, overall execution efficiency cannot be satisfactorily improved by a practical general-purpose compiler, especially a language processing system, such as the Just-In-Time Compiler for Java(copyright), that limits the compiling time.
Since satisfactory execution time information cannot be obtained by an actual general-purpose language processing system, it is difficult to specify an exact path to be optimized, such as a main trace. Thus, this problem is critical.
In view of the foregoing and other problems of the conventional systems and methods, therefore, it is an object of the invention to provide an optimization method for dividing a wide area into hyperblocks within a limited time, and for improving the execution efficiency for many paths having execution probabilities that are high.
To achieve this object, according to the invention, the estimated shortest processing time required when an overall predetermined area (e.g., instruction queue a) of a program is executed in parallel is recurrently calculated using the estimated shortest processing time for one part (e.g., instruction queue b or c) of the pertinent area. Then, the execution efficiency of the instruction queue a is compared with the execution efficiency attained when the instruction queues b and c are sequentially executed. When the execution efficiency for the instruction queue a is lower, the instruction queues b and c are formed as independent hyperblocks, and overall, the area in the program that is to be optimized is divided into multiple, appropriate hyperblocks.
According to the present invention, an optimization method for converting source code for a program written in a programming language into machine language, and for optimizing the program includes employing a basic block as a unit to estimate an execution time for the program to be processed, generating a nested tree that represents the connections of the basic blocks using a nesting structure, when a conditional branch is accompanied by a node in the nested tree, employing the execution time estimated by using the basic blocks as units to obtain an execution time at the node of the program for a case where a conditional branching portion of a program is directly executed and for a case where the conditional branching portion is executed in parallel, defining the node as a parallel execution area group when, as a result of an estimate, the execution time required for the parallel execution is shorter, or dividing multiple child nodes of the nodes into multiple parallel execution areas when the execution time for the conditional branching portion is shorter.
According to the optimization method, especially when the program is to be executed by a computer that can execute the predicated instruction and that can perform parallel execution at an instruction level, if the parallel execution at the instruction level is faster than the execution of the conditional branching portion, the conditional branching portion can be rewritten as a parallel execution portion.
The process of estimating the execution time for the program using the basic block as a unit includes employing the thus estimated execution time to obtain the critical path length for the program portion in the basic block, and the average parallel level of the program portion.
The basic block can be represented as a rectangle whose sides consist of the critical path length and the average parallel level. This rectangle can be deformed within a range where the side corresponding to the critical path length is not shorter than the critical path length.
The critical path length obtained for the basic block is the length of the longest instruction queue in the basic block that must be executed as needed in accordance with the dependency. The average level is the value obtained by dividing the number of all the instructions (e.g., corresponding to the total execution time) by the critical path length. That is, the parallel level that is represented is required when the instruction queues for the basic block are executed in parallel while the critical path length is maintained.
When the average parallel level is defined, and when multiple basic blocks are executed in parallel, a required parallel level is approximated when it reaches the linear sum of the average parallel levels of the basic blocks.
The generating of the nested tree includes generating a dependency graph that represents the dependency existing between the basic blocks, generating a preceding limitation graph by removing redundant branches from the dependency graph, and generating the nested tree in the preceding limitation graph by using a nesting structure to represent a connection consisting of node relationships.
The nodes of the nested tree are basic blocks of the program, the series suite and the parallel suite. The series suite is the connection in series of the basic blocks or of dependent suites, and the parallel suite is a parallel arrangement of basic blocks or other suites that evidence no dependency. That is, when the basic blocks or suites are grouped as the series suite or the parallel suite, this arrangement represents the nesting structure of the program.
The determining of the execution time for the conditional branching portion includes obtaining, for each of the parallel levels executable for the child node, the maximum execution time required when the child node is executed in parallel, and regarding a specific value derived from the maximum execution values of the parallel levels as the execution time that is required when the conditional branching portion is executed in parallel.
The determining of the execution time for the conditional branching portion includes employing the dependency at the instruction level between the basic blocks that serve as the child nodes to, before the determination of the execution time, correct information concerning execution times for the basic blocks.
Specifically, for the basic blocks forming the series suite, when the dependency is established between instructions other than the last instruction in the basic block and the first instruction in the following basic block, the series suite can be executed while the length (e.g., the critical path length of the series suite) is shorter than the simple connection, in series, of the basic blocks.
Further, determining the parallel execution area for the program includes comparing, when multiple child nodes are sorted into multiple parallel execution areas, the execution times required for the child nodes when the child nodes are executed in parallel at the parallel level provided for the hardware, and defining, as independent parallel execution areas, child nodes other than the child node whose execution time is the shortest.
Furthermore, the present invention either can be provided as a storage medium, in which is stored a computer program that is prepared to permit a computer to employ the above described optimization methods for compiling a program, or can be provided as a transmission apparatus, for transmitting the computer program.
According to the invention, a compiler for converting source code for a program written in a programming language into machine language, and for optimizing the program includes a first code scheduler for estimating the execution time for the program by using basic blocks as units, a hyperblock generator for assembling the basic blocks to generate hyperblocks as parallel execution areas, an execution time estimation unit for supporting the generation of the hyperblocks by the hyperblock generator by estimating the processing time required when a predetermined area of the program is executed, and a second code scheduler for performing code scheduling for each of the generated hyperblocks, where, when a predetermined node of a nested tree for which the connection of the basic blocks is represented using a nesting structure, based on the execution time, which has been estimated by using the basic blocks as units, the execution time estimation unit estimates the execution time required for the node of the program both for a case where a conditional branching portion of a program is directly executed and for a case where the conditional branching portion is executed in parallel, and where the hyperblock generator defines, as one parallel execution area group, a node for which the execution time estimated for the parallel execution by the execution time estimation unit is shorter, or for dividing, into multiple parallel execution areas, multiple child nodes of the node for which the execution time is shorter when execution of the conditional branching portion is directly performed.
The first code scheduler employs the execution time estimated by using the basic blocks as units to obtain the critical path length for the program portion in the basic block and the average parallel level of the program portion.
Before the determination of the execution time, the execution time estimation unit corrects information concerning the execution time for the basic block based on the dependency, at the instruction level for the basic block, that constitutes the child node.
The present disclosure relates to subject matter contained in Japanese Patent Application No. 2000-304618, filed Oct. 4, 2000, which is expressly incorporated herein by reference in its entirety.