1) Field of the Invention
The present invention relates to a method for measuring the execution time, iteration count, etc., of a specific portion of a program during its execution and for collecting the measured values as profile data so as to grasp of the behavior of the program.
2) Description of the Related Art
Generally, programmers need a profiling tool so as to better understand the behavior of their programs. The profiling tool is a means for obtaining and reporting, as profile data, specific instruction execution times measured by counters or the like, as well as timing information at the procedure, basic block level, and/or instruction level.
Specific program portions to be subjected to profiling are loops, timed regions, and conditionals. For loop profiling, the execution time and iteration count of a designated loop in a program are measured as profile data. For timed region profiling, the execution time of a designated timed region in the program is measured as profile data. For conditional (conditional branch) profiling, the results of judgment regarding a designated condition in the program are processed to obtain statistical information as profile data.
When the above-described profiling is performed for a program executed by a single processor, the profiling is conventionally performed as follows. A profiling tool is optioned for a sequential compiler when the program is compiled by the compiler. As a result, during the sequential compilation, instrumentation codes for designating a portion to be measured for collection of profile data are inserted (instrumented) into the codes constituting the program. Then, the above-described measurement is performed to collect profile data.
In contrast, when a sequential program written by a user is executed by a plurality of processors in parallel, the profiling is performed as shown in FIG. 12. First, a program written by using an original code input Fortran: e.g., HPF (High Performance Fortran) code! is inputted (step A). Then, compilation is performed for the program so as to convert it to a form suitable for parallel processing (step A2), and the converted program is outputted in a target language code output Fortran: e.g., Fortran 90! (step A3). In this specification, the original code will be sometimes referred to as an "original user code" or an "original source code".
After that, sequential compilation is performed for each of the programs for the processors so as to generate a program executable by the corresponding processor (step A4). Then, parallel processing is performed by the plurality of processors in accordance with the respective programs (step A5). During the compilation of the original code in step A2, code transformation is performed to realize faster processing, i.e., optimization is performed.
Even when profiling is performed in a parallel processing system, instrumentation codes for identifying the measurement start point and the measurement end point of an appropriate construct of the program must be inserted (instrumented) into the original code of the program so as to collect profile data. When the insertion of instrumentation codes is performed at the beginning of a compilation so as to carry out profiling, a possibility arises that the instrumentation slightly alters the optimization performed by the compiler. Therefore, the insertion of instrumentation codes must be performed after completion of all code conversions.
In the case where the above-described instrumentation codes are inserted, although it is easy to map profile data that have not undergone the sequential compilation (step A4) to corresponding profile data that have undergone the sequential compilation, it is difficult to map profile data that have not undergone compilation in which the above-described optimization is performed (step A2) to corresponding profile data that have undergone this compilation, because the original code is transformed by the compilation in step A2. Since a user inputs a program using the original code (input Fortran), the user cannot effectively utilize profiling results (i.e., cannot reflect the profiling results in the preparation of programs) if the profiling results cannot be obtained in a form corresponding to the original code.
When the above-described optimized code is debugged, the optimized code converted from the original code must be mapped to the user-written original code. To that end, a technique for causing a compiler to hold detailed histories of code transformations which will be debugged later is disclosed, for example, in "A New Approach to Debugging Optimized Code" (G. Brooks, G. Hansen, and S. Simmons, SIGPLAN '92 Conf., Programming Language Design and Implementation, pp. 1-11, 1992). However, unlike the case of debugging, a profiler does not need such detailed information about code transformation so as to identify constructs of a program to be profiled.
The prerequisite requirement of the profiling tool is to report accurately the behavior of a program being profiled. However, in order to collect profile data, instrumentation codes must be inserted into the original code, as described above, so as to activate and deactivate timers and to increment counters when predetermined constructs of a user-written program are executed.
Since subroutines (profile library subroutines, run-time library subroutines) which are called by the instrumentation codes are executed during the execution of the program written using the original code, a perturbation is produced, thereby skewing the behavior of the original program. Especially, when parallel programs are generated in a manner as has been described with reference to FIG. 12, the perturbation becomes more significant than the sequential program, causing a problem of increased overhead in which other processors must wait for the slowest processor to complete its processing.
There are two methods to obtain elapsed time in profile run-time systems, one which samples the instruction pointer and the other which uses a timer (clock).
Generally, a processor executes a program while holding the address of an instruction presently being executed. The instruction pointer (program pointer) designates the address of the instruction presently being executed. Therefore, an elapsed time can be obtained as profile data by sampling the value of the pointer and measuring the elapsed time from a point in time when the pointer address leaves a previously designated address to another point in time when the pointer address reaches another address.
That is, in the first method, the instruction pointer is sampled at constant intervals (at a constant sampling period), and the elapsed time of a designated area is calculated by multiplying the value of the instruction pointer for the designated area by the sampling period.
This method brings about the advantage of making the sampling cost constant. However, this method has the following disadvantages: (1) potential inaccuracy of calculated elapsed time, (2) necessity of processing the object code, and (3) difficulty in classifying the location of the instruction pointer (e.g., communication, synchronization, global operation, and run-time library).
In contrast, the second method has the following advantages: (1) reported elapsed time is accurate; (2) no further processing of the code is necessary after instrumentation using a clock; and (3) marking function categories in the compiler run-time library is simple (trivial). However, this method increases the overhead. When the overhead increases due to profiling, the execution time of a program accompanied by profiling becomes very long.
Since the execution time of a program generally increases when profiling is performed, it is desired to make the overhead time due to the profiling as short as possible.
In general, when profiling is performed for a procedure, each pair of a caller-side function (caller side: hereinafter referred to as a "caller") and a callee-side function (callee side: hereinafter referred to as a "callee") is grasped, and the execution time and the iteration count are measured for each caller-callee pair.
If the maximum number of caller-callee pairs at the time of execution is known, a two-dimensional table of caller-callee pairs (table for holding the caller-callee relationship of the functions) is prepared. In this case, the lookup and storage of information regarding the caller-callee pairs can be easily performed while performing the lookup at a small constant cost, without causing collisions of data.
However, in order to determine the maximum number of caller-callee pairs, an additional module for examining and modifying the object code is needed during link time. If all caller-callee pairs are considered, the two-dimensional table becomes quite huge, and therefore a large memory area must be wastefully prepared in the memory during the initialization.
One alternative method is to use a data structure the capacity of which increases dynamically during the execution of a program. An example of such a data structure is a hash table. When a hash table is used, only the record of the presently existing caller-callee pair can be held without wastefully using the memory space. However, since the hash table has a characteristic that its lookup time depends on the length of the list, the lookup cost is not constant and relatively large in the hash table. Although it is possible to control the length of the list by dynamically remaking the hash table, an inconstant and wasteful overhead is produced in that case.