The performance of computer systems is dependent on both hardware and software. Parallel systems, such as multi-threaded processor machines, are increasingly common. Two trends are broadening this usage pattern from systems for a specialized community of engineers and scientists to commonplace desktop systems. First, due to the reduction in geometric dimensions of devices and metal routes on-chip, it is common to have larger caches, multi-threading capability on processor cores, multiple cores on-chip, and special-purpose accelerators such as digital signal processors (DSPs) or cryptographic engines on-chip. These systems will have lots of hardware threads but are not expected to run at much higher clock frequencies. Second, techniques for automatic parallelization have been advancing. These capabilities may increase system performance by simultaneously executing multiple processes, and corresponding multiple threads, simultaneously.
To what extent the available hardware parallelism can be exploited may depend on the amount of parallelism inherent in a given software application. In addition to advances in hardware design, advances in compiler design attempt to extract further parallelism available in applications to reduce inefficient code execution. Automatic parallelization has been well studied in the past. Automatic parallelization seeks to parallelize sequential programs such that the resulting executable(s) may have improved performance on multi-threaded machines. Little or no parallelization effort is provided from the user as most of the work is done by the compiler and an accompanying runtime library.
One optimization that may be performed by the compiler is augmenting the source code with additional instructions at a location in the code before an identified parallel region. Identifying a parallel region may comprise detecting one or more of the following: a “for” or “while” loop construct, a user-specified directive such as an OpenMP pragma, a first function call with no data dependencies on a second function call, and a first basic block with no data dependencies on a second basic block.
Modern automatic parallelization techniques parallelize a loop construct if the compiler is certain that all loop iterations can be executed simultaneously. This is possible for loops having no cross-iteration dependencies. When there is certainty of this condition, these loops may be referred to as DOALL loops. For example, a loop can be executed in fully parallel form, without synchronization, if the desired outcome of the loop does not depend upon the execution ordering of the data accesses from other different iterations. In order to determine whether or not the execution order of the data accesses affects the semantics of the loop, the data dependence relations between the statements in the loop body may be analyzed. Accordingly, the dependence analysis can be used to categorize loops as DOALL or non-DOALL loops.
For any DOALL loops, traditional automatic parallelization techniques can reliably be used. For non-DOALL loops, cross-iteration dependencies (or even the potential for cross-iteration dependencies) can frustrate the applicability of many traditional automatic parallelization techniques. Thus, to extract further instruction level parallelism (ILP) from an application when non-DOALL loops are involved, additional or alternate techniques may be used.
One traditional technique for attempting to parallelize non-DOALL loops is to use helper threading, whereby a helper thread executes an abbreviated (or otherwise trimmed-down) version of an original loop construct on a different hardware thread that may provide preparatory work ahead of the actual execution work of loops. For example, memory reference address calculations and prefetching of data may occur ahead of the execution of the work to perform an algorithm or method of the loop. The non-DOALL loop may be segmented into a main thread and one or more non-main threads to be executed sequentially in program order.
A separate helper thread and the main thread typically share at least one level of the cache. The helper thread attempts to prefetch data into the shared cache in order that the main thread retrieves data directly from the shared cache without accessing a lower-level memory due to misses to the shared cache. An example of helper threading is provided in Y. Song et al., Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors, IEEE PACT, 2005, pp. 99-109.
While the helper threads accelerate the execution of the loop by prefetching and/or other techniques, they do not typically perform any of the loop computations. Accordingly, the level of acceleration realized from the use of helper threads may be reduced when a loop involves highly complex computations. Further, helper thread techniques may limit cache utilization potential. For example, for the helper thread to deliver data to cache of the main thread, the helper thread and main thread may both have to be running on a single core and using only that core's cache hierarchy.
Another traditional technique for attempting to parallelize non-DOALL loops is to use speculative automatic parallelization. According to speculative automatic parallelization, hardware transactional memory support may be used (such as the underlying hardware's checkpoint/commit capability) to speculatively execute a loop. Loop iterations may be divided among the main thread and non-main threads. Each non-main thread will attempt to speculatively execute the loop body, where the loop body is encapsulated inside a checkpoint/commit region. A transaction failure will trigger either retries of the same speculative execution, or waiting to execute the work non-speculatively after the previous logical thread has completed its work.
It may often be difficult to detect and/or recover from transaction failures (e.g., errors in speculative execution). For example, if loop variables in iteration K of the loop are affected by computations during previous iteration J of the loop, speculative computations of the Kth iteration may be incorrect. The technique must be able to both reliably detect the incorrectly pre-computed values and to reliably roll back execution of the program to an appropriate execution location. Accordingly, speculative automatic parallelization techniques may involve additional costs, including additional hardware support and additional time and resources expended in unused pre-computations, clean-up, and bookkeeping.