The present invention relates to a system and method for optimizing program execution and more particularly to profiling of computation loops for dynamic optimization over subsequent runs of the profiled loops.
Computer systems having multi-processors have allowed computers to perform multiple tasks at the same time, thus reducing the time necessary to complete an operation. One type of multiprocessor system is a symmetric multiprocessor (SMP) system. Software designed to run on such systems is normally optimized to take advantage of the multiple processors. Typically, this involves coding parallel loops that can be executed using a number of threads that will execute the different iterations of the loop in parallel. This process is normally termed xe2x80x9cparallelizationxe2x80x9d. For SMP systems, there are several ways to allow a program to use the parallel processors. A programmer may use automatic parallelization included as a compiler utility or parallelize the code by hand by using directives such as to a custom API, or use thread libraries such as POSIX threads. Automatic parallelization works at the level of loop nests, and can be very effective for programs that spend most of their time in nested loops.
Parallelization results in a certain amount of overhead. This overhead occurs as a result of setting up the parallel environment and subsequently to synchronize at the end of the parallel segment. The potential benefits of parallelization can be realized if this overhead is very small compared to the computation performed. Due to unknown loop bounds and the complexity of performance prediction, a compiler may have parallelized some of the loops that will not benefit from parallelization. In addition, the user may have parallelized loops that do not benefit from parallelization because of the inherent overheads involved.
Thus, profiling was introduced in an attempt to identify and remove such parallelization in some cases. Profiling is an integral part of understanding and tuning an application for improving program performance and may be used to monitor the resource usage during the execution of the application program. A program profile is a characterization of the execution of a program. A profile may typically include the execution time, paging requirements, and cache misses for each subprogram in the application. These are typically some of the resources that are monitored using program profiling. The resource usage information collected by the executing program is then used to fine-tune the performance of the subprograms. This fine-tuning can be done either by the programmer manually or by the compiler. The profiling option xe2x80x98-pxe2x80x99 provided in UNIX(copyright) environments and the xe2x80x98PDFxe2x80x99 option available with the IBM(copyright) XL Fortran compilers are examples of programmer directed profiling. Using the profiling information, the compiler can generate code such that the paths that are executed most often are well optimized.
This form of profiling works in a multi-pass approach consisting of at least two passes. In the first pass, the profiling information is collected and that information is used to fine-tune the application for subsequent program executions. This type of approach is referred to as static profiling because the information gathered during the execution of the program is used after the program terminates.
There are, however, limitations to the static approach. The program is generally run once before optimization. The requirements for optimal performance are used on all subsequent program executions. This can lead to a problem if some of the loops in the program are data dependent, that is the choice between serial or parallel execution of the loops depends on the data set. In this case the programmer has to resort to dual or multi-path code based on the input data. The programmer thus chooses to execute the parallel version or the serial version of the application depending on the input data. This situation results in a familiar problem; the programmer may parallelize loops that should be run serially.
Serially run programs may also be optimized to run faster on uniprocessor systems. In this instance, loop unrolling may be applied at the compiler stage to generate faster executing code. Loop unrolling typically repeats the code in an inner loop a number of times. The number of times the code is replicated within the unrolled loop is termed the unroll factor. Once again one of the deficiencies of current optimizers is that once an optimum unroll factor is determined, it is used on all subsequent program executions. The unroll factor in a loop is dependent not only on the bounds of the loop but also on the machine characteristics which are difficult to model at compile time. At compile time, heuristics are used to select a loop unroll factor. Thus, the optimization is still programmer dependent and does not fully reflect changes in the dataset during execution of the application.
It is an object of the present invention to obviate and mitigate some of these disadvantages.
The invention seeks to provide a solution to the problem of optimizing computation loops which are data dependant.
In accordance with this invention there is provided a method for optimizing a computer program, comprising the steps of executing an application program, profiling a loop of the executing program to determine a parameter for the loop, comparing the parameter to a threshold value for the loop and flagging the loop for applying an optimization on subsequent execution of the loop depending on said comparing step. Said method may also be provided wherein said optimization comprises a serialization of the loop. Further, said methods may be provided wherein said optimization comprises a parallelization of the loop. The above methods may also be provided wherein the loop includes a parallel and a serial version, and said serialization is a selection of the serial version for execution, and said parallelization is the selection of the parallel version for execution. The methods may also be provided wherein said step of profiling includes sampling the loop at a predetermined frequency. Said step of profiling may also include measuring an execution time of the loop. Further, the threshold value may include a sequential threshold value and a parallel threshold value, said sequential threshold value for determining an execution time above which a serially executing loop will be parallelized, and said parallel threshold value for determining an execution time below which a parallel executing loop will be serialized. The above methods may also be provided wherein said program executes on a computer system having a plurality of processors, and said optimization comprises the selection of a number of processors for execution of the loop and also wherein said threshold value is a preferred execution time for the number of processors selected. Also, the above method may further comprise the step of compiling the program with a plurality of unroll factors prior to execution and wherein said optimization comprises selection of one of said unroll factors for the loop.
There is also provided a computer system for optimizing program execution, comprising means for executing an application program; means for profiling a loop of the executing program to determine a parameter for the loop; means for comparing the parameter to a threshold value for the loop; and means for applying an optimization on subsequent execution of the loop depending on a result of said comparing means. The above computer system may also be provided wherein said optimization comprises a serialization of the loop. Further, said optimization may also comprise a parallelization of the loop. The computer system may also provided wherein the loop includes a parallel and a serial version, and said serialization is a selection of the serial version for execution, and said parallelization is the selection of the parallel version for execution. The computer system may also be provided wherein said means for profiling includes means for sampling the loop at a predetermined frequency. Said means for profiling may also include means for measuring an execution time of the loop. The threshold value may also include a sequential threshold value and a parallel threshold value, said sequential threshold value for determining an execution time above which a serially executing loop will be parallelized, and said parallel threshold value for determining an execution time below which a parallel executing loop will be serialized. Further, the computer system may include a plurality of processors, and said optimization comprises the selection of a number of processors for execution of the loop. The computer system may also be provided wherein said threshold value is a preferred execution time for the number of processors selected. And, the computer system may be further comprise means for compiling the program with a plurality of unroll factors prior to execution and wherein said optimization comprises selection of one of said unroll factors for the loop.
There is also provided an article of manufacture comprising a computer usable medium having computer readable program code embodied therein for optimizing program execution in a computer system, the computer readable program code in said article of manufacture comprising computer readable program code configured to cause a computer system to execute an application program, computer readable program code configured to cause a computer system to profile a loop of the executing program to determine a parameter for the loop, computer readable program code configured to cause a computer system to compare the parameter to a threshold value for the loop and computer readable program code configured to cause a computer system to flag the loop for applying an optimization on subsequent execution of the loop depending on said comparing code. The above article of manufacture may also be provided wherein said optimization comprises a serialization of the loop. Further, the article of manufacture may be provided wherein said optimization comprises a parallelization of the loop. There may also be provided an article of manufacture wherein the loop includes a parallel and a serial version, and said serialization is a selection of the serial version for execution, and said parallelization is the selection of the parallel version for execution. Further, said computer readable program code configured to cause a computer system to profile may include computer readable program code configured to cause a computer system to sample the loop at a predetermined frequency. Further, said computer readable program code configured to cause a computer system to profile may include computer readable program code configured to cause a computer system to measure an execution time of the loop. Also, the threshold value may include a sequential threshold value and a parallel threshold value, said sequential threshold value for determining an execution time above which a serially executing loop will be parallelized, and said parallel threshold value for determining an execution time below which a parallel executing loop will be serialized. There may also be provided the above article of manufacture wherein said computer system includes a plurality of processors, and said optimization comprises the selection of a number of processors for execution of the loop. The article of manufacture may be provided wherein said threshold value is a preferred execution time for the number of processors selected. And, the article of manufacture may further comprise computer readable program code configured to cause a computer system to compile the program with a plurality of unroll factors prior to execution and wherein said optimization comprises selection of one of said unroll factors for the loop.
There is also provided an article of manufacture comprising a computer usable medium having computer readable program code embodied therein for optimizing program execution in a computer system, the computer readable program code in said article of manufacture comprising computer readable program code configured to cause a computer system to execute an application program; computer readable program code configured to cause a computer system to monitor a parameter of a loop of said executing program; computer readable program code configured to cause a computer system to compare the monitored parameter to a threshold value for the loop; and computer readable program code configured to cause a computer system to apply an optimization on subsequent execution of the loop depending on a result of said comparison code.