Conventional dynamic program optimization systems use profiling information to compile optimized regions of a program, and direct program control flow to the optimized regions during execution of the program. Execution of the optimized regions typically results in higher performance for the program. While some program optimizations may be determined without profiling information, the profiling information enables execution-specific optimizations.
A specific example of a dynamic program optimization is trace scheduling. The profiling information may record the number of times that common paths through the program are taken by an execution thread. When trace scheduling is performed, the most frequently executed paths are identified and optimized by placing the frequently executed paths in sequence and implementing instruction scheduling along the entire path (rather than along any other path that might intersect the selected trace). Once the optimized trace is translated into binary instructions, the binary instructions corresponding to the optimized trace may be executed instead of the original binary instructions.
To enable execution of the optimized trace, the conventional dynamic program optimization system halts execution of the program and, while the program is halted, the region of the program may be replaced with the optimized representation of the region, so that subsequently, the optimized representation is executed for the region. Specifically, the program may be patched so that branches to the original binary instructions are redirected to the binary instructions corresponding to the optimized trace.
The conventional dynamic program optimization process typically relies on switching back and forth between execution and optimization phases. The optimization phase requires exclusive access to the program that is being optimized, so execution of the program is halted during the optimization phase. Halting execution of a program during optimization phases is not a large burden when the program is executed on a sequential processor, such as a conventional central processing unit (CPU) that is single-threaded, because the sequential processor can only execute one thread at a time, where the execution and optimization phases correspond to two different threads.
In contrast with the conventional CPUs, parallel systems, such as graphics processing units (GPUs) are implemented with a large number of cores arranged in a highly parallel architecture. These circuits are typically specialized to process large sets of data in parallel, especially graphics data. For example, a highly parallel GPU may be configured with eight or more cores and each core may be configured to simultaneously execute at least 32 threads, so that the GPU may simultaneously execute at least 256 threads.
As previously explained, in a sequential processor, branches may be redirected to execute an optimized trace the next time that the original binary instructions are executed. However, in parallel processors that execute multiple threads simultaneously, at any time, a thread may be executing a particular branch that will be redirected as a result of an optimization. To ensure that threads are not executing a branch while the branch is being modified to redirect the branch, instruction memory pages that contain the branch instruction to be patched should be read protected. Read protecting the instruction memory pages causes threads that access the instruction memory pages to fault and be suspended by the system software until the modification is completed. Using read protection enables correct execution during dynamic program optimization, but also introduces high overhead for parallel processors with a large number of threads because many threads may be suspended. Suspending execution of 256 or more threads to perform dynamic program optimization may result in a performance reduction that cannot be overcome by the optimization. Thus, there is a need for addressing these issues and/or other issues associated with the prior art.