1. Technical Field
The present invention relates generally to computer programs and, in particular, to a method and apparatus for profiling computer program execution.
2. Background Description
Contemporary high-performance processors rely on superscalar, superpipelining, and/or very long instruction word (VLIW) techniques for exploiting instruction-level parallelism in programs (i.e., for executing more than one instruction at a time). In general, these processors contain multiple functional units, execute a sequential stream of instructions, are able to fetch from memory more than one instruction per cycle, and are able to dispatch for execution more than one instruction per cycle subject to dependencies and availability of resources.
The performance of programs can be greatly enhanced if information about the typical execution path of the programs is known so as to optimize program execution for such paths. To this end, program profile information is necessary which describes the typical execution behavior, such as, for example, the probability that a given branch is taken, the correlation between different branches and typical execution path information, the cache miss rate of a particular memory operation, and so forth.
An exemplary overview of the use of profile information in the compilation of programs is described by Chang, et al., in “Using Profile Information to Assist Classic Code Optimizations”, Software Practice and Experience, Vol. 21(12), pp. 1301-21, December 1991.
Profiling can be used to optimize programs during static or dynamic compilation. The use of profile information in static compilation is described by Chang et al. in the above referenced article entitled “Using Profile Information to Assist Classic Code Optimizations”. The use of profile information for dynamic optimization at program runtime is described by: Ebcioglu et al., in “Execution-Based Scheduling for VLIW Architectures”, EuroPar '99 Parallel Processing—5th International Euro-Par Conference, Berlin, Germany, pub. Springer Verlag, pp. 1269-80, August 1999; and Gschwind et al., in “Dynamic and Transparent Binary Translation”, IEEE Computer, pp. 54-59, March 2000.
Many techniques have been proposed to perform profiling of executing programs. Traditionally, static (compile- and/or link-time) instrumentation of code has been used to modify code to generate and gather profile information. A separate run of the program is then performed, which generates and stores the information on disk. The profile is then read back in by the compiler back-end and used to optimize the code. This technique is implemented in tools such as XPROF and PIXIE. This technique has the disadvantage that the execution pass made for the express purpose of profiling typically has high overhead, and since it is conducted in laboratory conditions, may not gather the actual profile of the program under end-user control. Hence the usefulness of the technique is limited. Static instrumentation for profiling and the use of profile information for optimization is described by Chang et al., “Using Profile Information to Assist Classic Code Optimizations”, Software Practice and Experience, Vol. 21(12), pp. 1301-21, December 1991. PIXIE is described by M. Smith, in “Tracing with PIXIE”, No. CSL-TR-91-497, Center for Integrated Systems, Stanford University, pp. 1-29, November 1991.
Dynamic instrumentation of program code, which is an extension of the static instrumentation technique, inserts the instrumentation code at run-time. This approach suffers from the drawback that most of the information that the compiler has about the syntax and the semantics of the program statically is unavailable dynamically. Hence, it can only make crude guesses about the nature of the instrumentation to be inserted into the program. Further, the instrumentation code also slows the mainline execution of the program, just as in the static case. The SHADE emulator on the Sun SPARC architecture performs dynamic instrumentation to some extent. A description of a reference to this emulator is provided hereinbelow.
Emulation of an architecture can be used to run a program, and profile information can be collected using access methods to the internal architectural state of the emulated machine. This approach has two drawbacks: (1) the emulation is quite slow (typically 10 to 100 emulator instructions per emulated instruction), and (2) the profile information is only accurate at the ISA level; none of the microarchitectural bottlenecks can be captured and identified under the emulation technique. Various emulators have been described in the literature, such as, for example: Keppel et al., in “SHADE: A Fast Instruction-set Simulator for Execution Profiling”, Proceedings of the 1994 Conference on Measurement and Modeling of Computer Systems, Nashville, Tenn., SIGMETRICS, pp. 128-137, May 1994.
Dedicated Counters are available on modern processors such as PowerPC 604e and Pentium Pro, which can be programmed to watch for specific hardware events, and count them. Using dedicated counters is desirable because they do not perturb the other system state (such as the data cache), when counting is performed. However, there are some drawbacks to this approach. The counters cannot distinguish between multiple user-mode programs, losing some level of accuracy. Also, the information gathered is summary information, at a higher level of granularity. The approach is described in the International Business Machines Corp. PowerPC 604e User's Manual, IBM Order No. SA14-2044-00, IBM Microelectronics, Essex Junction, Vt. Using counters in memory is not a very good idea for profiling, because the counters then reside in the memory of the machine, which means they are accessed (read from and written to) the data caches. This perturbs the very behavior of the program that the instrumentation code attempts to measure.
Special instructions to support profiling is another technique, a flavor of which was described in a proposal for the recently unveiled IA-64 from Intel. According to this approach, the IA-64 uses an “initprof” instruction for initializing a memory area for collecting profile information. The instruction encodes enough information for the machine hardware to accurately gather and store away relevant profile information. This technique can be seen as a variant of the static instrumentation techniques, but with less overhead. The drawback of this technique is that the application still must be instrumented with these special instructions, a proposition that the software developers are less likely to accept for their final, production versions of code that are shipped to end customers. The counters are stored in the memory of the machine, which again leads to the data-cache perturbation problem. The initprof instruction is further described by Lee et al., in “An Efficient Software-Hardware Collaborative Profiling Technique for Wide-Issue Processors”, Proceedings of the 1999 Workshop on Binary Translation, Newport Beach, Calif., Oct. 18, 1999, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pp. 34-42, December 1999.
A method of profiling, referred to as PROFILEME, tracks a sample of instructions in an out-of-order microarchitecture. The technique enables “observation” of all of the work that is performed on behalf of an arbitrary instruction that flows through the pipeline of an OOO processor core. The main focus is not to collect the aggregate information, but to observe the behavior of a given instruction as the instruction flows. This view is orthogonal to the technique of the invention. PROFILEME is described by Chrysos et al., “PROFILEME: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors”, Proceedings of the 30th Symposium on Microarchitecture (Micro-30), pp. 292-301, December 1997.
Therefore, it is evident that there is a need for a method and/or apparatus for profiling which: (1) can provide accurate resolution of profile information for a significant number of simultaneously profiled events; (2) does not disturb the program execution behavior of the program being profiled; (3) offers high performance; (4) is useable to profile in real-time; (5) does not require changes to the application being profiled; and (6) provides profile information for use in dynamic optimization at program runtime.