1. Field of the Invention
The present application relates generally to an improved data processing apparatus and method and more specifically to an apparatus and method for a runtime dependence-aware scheduling of independent iterations so that the independent iterations are scheduled and executed ahead of time in parallel with other iterations.
2. Background of the Invention
Typically, loops within executable code of an application take most of the execution time of the application; therefore, in order to improve performance of applications, parallelization of loops is very important. Current parallelizing compiler infrastructure analyzes code at compilation time to identify loops that are amenable to parallelization. Thus, all iterations within the code should be independent, i.e. any two iterations do not access the same data and one of the accesses is a write. Once the independent iterations are determined, the compiler then outlines the loop body as a function. At runtime, symmetric multiprocessing (SMP) runtime controls how iterations are distributed to multiple threads that are running simultaneously, such that the execution of the loop is parallelized.
A major difficulty for loop parallelization is the uncertainty of memory accesses across iterations, which are often impossible to determine at compilation time. Several obstacles may prevent the compiler from properly deriving the dependences, such as:
1. Pointer accesses that may not be determined statically,
2. Uncertain control flow that may bypass some memory accesses,
3. Array elements indexed by complicated computations, or
4. Array elements indexed by other arrays (indirect array accesses).
Thread Level Speculation (TLS) may be used to deal with unknown dependences. Using hardware to detect conflicting memory accesses across iterations relieves the compiler from analyzing the dependences. However, once a conflict is detected, the loop must be rolled back in order to allow the earlier thread to finish. Rollback is typically expensive, especially for loops with a significant number of conflicting memory accesses. Besides, TLS relies heavily on hardware support that may increase the latency on other data paths. Currently, there is no real hardware support by any chip manufactures. Most importantly, the compiler may normally provide valuable information regarding the independence of some iterations. TLS tends to discard all such information by relying completely on the hardware to detect dependences.
Some early research proposed inspectors that perform dependence computation before the loop is executed. Inspectors execute in front of the main loop, and, thus, an upfront cost of extra execution time is paid whether the loop is parallelizable or not. Also, the inspector only checks if the loop is completely parallelizable or not. Oftentimes, a loop may contain iterations that are partially parallelizable, i.e. a subset of iterations that can be parallelized may be identified. The inspector approach is not able to capture partial parallelization. In addition, with processor chips that comprise multiple processing cores, the number of cores may be larger than the amount of parallelism. While having multiple cores provides a great opportunity to speedup dependence computation, inspectors do not take advantage of this capability.