The present invention relates to highly parallel computer architectures such as graphic processing units (GPUs), and in a particular to a method of estimating the degree by which a program will speed-up when ported to a highly parallel architecture, for example, from a different less parallel architecture, using statically measured program characteristics.
Current high-performance computers may employ processor systems having a range of different architectures. One processor system may be in the form of one or more CPUs (computer processing units) each having a general instruction set intended for serial execution of tasks, and another processor system may be a GPU (graphics processing unit) having many hundreds of processing elements and a specialized instruction set intended for parallel execution of tasks, typically associated with graphics processing. Often these two processing systems are combined in the same computer.
The ability of the GPU to handle not only graphic tasks but also generalized computational tasks that can be parallelized, for example, by stream processing, has led to a so-called “heterogeneous processing” in which the GPU handles non-graphics program tasks normally performed by the CPU. In this regard, some programs can experience multiple factors of “speed-up” when moved (“ported”) from the CPU to a GPU.
Porting a program from a CPU to a GPU requires substantial restructuring of the software and data organization to match the GPU's many-threaded programming model. Code optimization of such ported programs can be very time-consuming and require specialized tools and expertise. The costs of porting programs to a GPU make it desirable to know if program speed-up will justify the effort before substantial effort is expended. Unfortunately, the performance advantage of such porting is not known until the GPU code has been written and optimized.
U.S. patent application Ser. No. 14/212,711 filed Mar. 14, 2014, hereby incorporated by reference and assigned to the assignee of the present application, describes a system that can estimate the amount of speed-up that can be obtained in a program by moving it between architectures, such as from a CPU to a GPU. This system makes detailed “dynamic” measurements of a target program to be ported, that is measurements taken when the target program is operating on the first architecture (e.g., CPU). The system then applies these dynamic measurements to a machine learning model which can output an estimate of program speed-up when the program is ported to the second architecture (e.g., GPU).
Dynamic measurement, that is, measurements of the program as it is actually executing, can reveal, for example, which way branch conditions are resolved during program execution. When a branch condition has data dependence, meaning that the branch depends on the value of data from main memory, the direction of the branch cannot be determined from static analysis of the program but may require knowledge of values that are not resolved until the program runs.
Knowing how branch conditions are resolved provides information about which instructions are executed and how many times they are executed (for example, in a branch control loop). This latter information in turn reveals the predominant type of instructions that are executed and the amount of main memory access, information that strongly affects how the program will perform on a given architecture type.
While dynamic measurement reasonably appear to be necessary for accurate estimation of the dynamic property of program execution speed when run on a given architecture, such dynamic measurements are not always practical. The software “instrumentation” used to make dynamic measurements can interfere (slow) execution of the programs being measured, interfering with measurement. Implementing the instrumentation to acquire dynamic measurements can require substantial amount of execution time and programmer effort.