The present application relates generally to an improved data processing apparatus and method and more specifically to an apparatus and method for detecting task complete dependencies using underlying speculative multi-threading hardware.
Typically, loops within executable code of an application take most of the execution time of the application. Therefore, in order to improve performance of applications, parallelization of loops is very important. Current parallelizing compiler infrastructures analyze code at compilation time to identify loops that are amenable to parallelization. Thus, all iterations within the code should be independent, i.e. any two iterations do not access the same data, and one or more of the accesses is a write. Once the independent iterations are determined, the compiler then outlines the loop body as a function. At runtime, symmetric multiprocessing (SMP) runtime controls how iterations are distributed to multiple threads that are running simultaneously, such that the execution of the loop is parallelized.
A major difficulty for loop parallelization is the uncertainty of memory accesses across iterations, which are often impossible to determine at compilation time. Several obstacles may prevent the compiler from properly deriving the dependencies, such as pointer accesses that may not be determined statically, uncertain control flow that may bypass some memory accesses, array elements indexed by complicated computations, or array elements indexed by other arrays (indirect array accesses).
Thread Level Speculation (TLS) may be used to deal with unknown dependencies. Using hardware to detect conflicting memory accesses across iterations relieves the compiler from analyzing the dependencies. However, once a conflict is detected, the loop must be rolled back in order to allow the earlier thread to finish. Rollback, or “squashing” of the thread, is typically expensive, especially for loops with a significant number of conflicting memory accesses.
With speculative multi-threading (SMT), tasks can be speculatively executed even in the presence of data dependencies. The dedicated hardware keeps track of speculative thread read and write data locations and aborts, i.e. rolls back or squashes, threads that are shown to have violated an actual data dependency. While this approach has been shown to work fairly well in program code where a compiler could not prove data independence between tasks, it generally performs sub-optimally in code where there are some or many dependencies between the tasks. This is because, in the presence of dependencies, speculative tasks start to be aborted in significant numbers, thus exhibiting little parallelism advantages while experiencing many of the speculative parallelism disadvantages, e.g., increased memory footprint pressure at version cache level, wasted compute cycles, wasted resources, wasted energy, and the like.