1. Technical Field of the Invention
The present invention generally relates to compiler systems. More particularly, the present invention is directed to an adaptive optimization system and method for utilizing hardware performance monitors to improve an application's performance during its execution on a particular micro-architecture.
2. Description of the Prior Art
With rapidly changing hardware, modem compilers have to target a variety of architectures or architecture implementations. To make this task easier, most compilers consist of an architecture-independent frontend section and an architecture-specific backend section the result of which is an executable application. More particularly, while the frontend section transforms a source application into an intermediate representation (i.e., “IR”), the backend section transforms the IR generated by the frontend section into a sequence of machine instructions (i.e., instruction schedule) for a particular instruction set architecture (i.e., “ISA”), such as PowerPC, which is to be executed on a specific implementation of the ISA (i.e., micro-architecture), such as PowerPC 604e.
Typically, the backend section views the particular ISA as a collection of resources (e.g., caches, registers and the like) and constraints between them, which comprise a micro-architectural model for the particular ISA. The backend section utilizes the micro-architectural model to select a “better” instruction schedule, which takes the fewest clock cycles to execute. In general, the more precise the micro-architectural model the better the instruction schedule that can be generated by the backend section, but the more time that is required by the backend section to generate the instruction schedule. It is noted that the increase in time associated with generating a better instruction schedule is often nonlinear. Notwithstanding the preciseness of the micro-architectural model actually used, it will nevertheless have some imprecision, if for no other reason than some facts that are dependent on the application's execution behavior are impossible to ascertain at compile time, such as the location of data in main memory and the like.
Various compilers exist: static and dynamic. For static compilers, where the instruction schedule is generated before the application executes and the cost of compilation is amortized over many application executions, the execution time of the backend section is less important and the precision of the micro-architectural model is more important because the static compiler has only one opportunity to guess the correct schedule of instructions. If the guess is wrong, the consequence is poor performance of the application during its execution. For dynamic compilers, the instruction schedule is generated at application execution time and the compilation time is counted as part of the application's execution time. Typically, with the dynamic compiler the precision of the micro-architectural model is not as important as the saving of the execution time and thus the precision of the micro-architectural model may be sacrificed in lieu of savings in the execution time. One such compiler is a just-in-time (i.e., “JIT”) compiler, which compiles/optimizes the application only once. Just like the static compiler, the JIT has only one opportunity to guess the instruction schedule and the making of an incorrect “guess” may also lead to poor application performance during application execution. An alternative dynamic compilation strategy is an adaptive optimization system (i.e., “AOS”). In the AOS, the backend section of the compiler has an opportunity to “guess” multiple times to try to get a better instruction schedule. After guessing, the AOS may evaluate the guess and guess again if appropriate, thereby helping to eliminate poor application performance due to one or more bad guesses. Furthermore, the backend section may only have to guess again for parts of the application that have the potential to make a performance difference, and for these parts the backend section can spend more time on compilation because they are a small fraction of the total application size.
Many of the currently available microprocessors provide hardware performance monitors (i.e., “HPMs”), which count a number of times that a micro-architectural event that captures some behavior of a particular micro-architectural resource occurs on the micro-architecture. For example, the typical micro-architectural resources that may be counted may include caches or functional units within a micro-architecture. Functional units represent stages in a pipelined superscalar micro-architecture. The stages may include fetch/decode, dispatch, execute and complete on the particular micro-architecture. The execute stage may include integer and floating-point units, as well as branch and load/store units. The typical events that may be counted include the number of times a micro-architectural resource starts, completes and stalls. For example, an instruction's execution may stall if a value that is an input to the instruction at the time of execution is not available, or if an underlying micro-architectural resource that is required by the instruction is not yet available. The HPMs may be used to generate offline information, which determines where execution time is spent in the application and which may be used to identify parts of the application that should be modified to improve micro-architectural resource utilization. In order to generate offline information, the application is executed to collect HPM data and after the application completes, the HPM data is analyzed to determine how to modify the application's behavior for subsequent executions of the application.
A drawback associated with utilizing offline HPM data from one execution to modify the behavior of an application for subsequent executions is that the modification may not result in improved performance of the application when subsequent executions have different behaviors. For example, the application's behavior may differ from one execution to the next because of the different input to the application. In addition, because offline information is aggregated, the behavior of individual application components may not be obvious. For example, if an application has phase shifts, the phases may not be apparent in the offline information that is collected across all phases. Therefore, offline information may be imprecise and thus may not be useful for modifying application behavior.
FIG. 1 illustrates prior art compiler system 100 without use of hardware performance monitors. The compiler system 100 comprises a static compiler 116, which includes frontend 104, intermediate representation (i.e., “IR”) 106, selection/scheduling heuristics 108, machine model 110 and backend 112, all of which are described in detail below. In the compiler system 100, the source code 102 represents an application's source code, which is to be compiled by the static compiler 116. The frontend section 104 of the static compiler 116 takes as input or reads in the application's source code 102, parses the source code 102 and generates an IR 106, which breaks down instruction in the source code 102 into a plurality of low-level abstract operations that are more conducive to optimization. The IR 106 is a sequence of operations that has implied data and control dependencies between the operations. For example, on a reduced instruction set computer (i.e., “RISC”) microprocessor, the low-level abstract operations comprise loads and stores of memory values into registers and subsequent computations on the values in the registers. It is noted that at this point, the registers are symbolic registers, which are subsequently translated into actual hardware registers by the backend section 112. The backend section 112 of the static compiler 116 reads in the IR 106 and generates an executable (i.e., “EXE”) 114, which represents a schedule of macro-architectural instructions, i.e., assembly language instructions for a particular instruction set architecture ISA, e.g., PowerPC. As aforementioned, the ISA defines a particular target architecture to which a user-level application must conform. More particularly, the backend section 112 selects micro-architectural instructions for the IR operations 106, orders the instructions via instruction scheduling and maps symbolic registers into physical registers via register allocation.
Further with reference to FIG. 1, during EXE 114 generation, the backend section 112 consults the machine model 110, which describes characteristics about the particular target micro-architecture, e.g., PowerPC 604e. For example, the machine model 110 may include micro-architectural resources that are available (e.g., fetch/decode, dispatch, execute, and complete phases of a pipelined superscalar microprocessor), a number of instances of a particular micro-architectural resource (e.g., there may be a number of integer functional units which partially comprise the execute stage), and clock cycles that are required for a value to flow from one micro-architectural resource to another (e.g., the number of cycles for an instruction to be executed after it has been dispatched). More particularly, the machine model 110 provides detailed information about the underlying micro-architecture that the backend 112 uses to determine latencies and constraints between instructions for a particular instruction schedule. For example, if the expected latency or delay of an instruction is d clock cycles for a particular micro-architectural resource, then the backend 112 will attempt to schedule all instructions that are dependent on the value generated by the instruction at least d clock cycles later. Furthermore, if there is more than one integer functional unit, the backend 112 may schedule more than one integer instruction to be executed in the same clock cycle.
Yet further with reference to FIG. 1, in addition to the machine model 110, the backend 112 further consults selection/scheduling heuristics 108, which are used for instruction selection and instruction scheduling in the EXE 114. For example, there may be a plurality of instructions that could be selected at any given time and the heuristics 108 help the backend section 112 select the optimal instructions among the plurality of instructions so that the instruction schedule will finish executing in the fewest possible number of clock cycles on the particular target micro-architecture. One goal of the backend 112 is to generate a “valid” instruction schedule that orders the instructions so that their execution will maintain the data dependencies between instructions. For example, if the value generated by executing instruction A is used by instruction B, then instruction A must be scheduled for execution before instruction B.
In view of the foregoing, there is a need in the art for providing a system and method for utilizing hardware performance monitors to evaluate and modify the behavior of an application during its execution. More particularly, there is a need in the art for providing an adaptive optimization system and method for utilizing hardware performance monitors to improve an application's performance while the application is executing.