1. Field of the Invention
This invention relates to computer systems, and more particularly, to providing an efficient method of automatically parallelizing a computer program for simultaneous execution using multiple threads.
2. Description of the Relevant Art
The performance of computer systems is dependent on both hardware and software. Parallel systems, such as multi-threaded processor machines, are increasingly common. Two trends are broadening this usage pattern from systems for a specialized community of engineers and scientists to commonplace desktop systems. First, due to the reduction in geometric dimensions of devices and metal routes on-chip, it is common to have larger caches, multi-threading capability on processor cores, multiple cores on-chip, and special-purpose accelerators such as digital signal processors (DSPs) or cryptographic engines on-chip. These systems will have lots of hardware threads but are not expected to run at much higher clock frequencies. Second, techniques for automatic parallelization have been advancing. These capabilities may increase system performance by simultaneously executing multiple processes, and corresponding multiple threads.
The extent to which available hardware parallelism can be exploited is directly dependent on the amount of parallelism inherent in a given software application. In addition to advances in hardware design, advances in compiler design attempt to extract further parallelism available in applications to reduce inefficient code execution. Automatic parallelization has been well studied in the past. Automatic parallelization seeks to parallelize sequential programs such that the resulting executable(s) may have improved performance on multi-threaded machines. Little or no parallelization effort is required from the user as most of the work is done by the compiler and an accompanying runtime library.
One optimization that may be performed by the compiler is augmenting the source code with additional instructions at a location in the code before an identified parallel region. Identifying a parallel region may comprise detecting one or more of the following: a “for” or “while” loop construct, a user-specified directive such as an OpenMP pragma, a first function call with no data dependencies on a second function call, and a first basic block with no data dependencies on a second basic block.
Modern automatic parallelization techniques parallelize a loop construct if the compiler is certain that all loop iterations can be executed simultaneously. Such loops may be referred to as DOALL loops. A loop can be executed in fully parallel form, without synchronization, if the desired outcome of the loop does not depend upon the execution ordering of the data accesses from different iterations. In order to determine whether or not the execution order of the data accesses affects the semantics of the loop, the data dependence relations between the statements in the loop body must be analyzed.
After dependence analysis and loops are generally categorized as either DOALL loops or non-DOALL loops, modern automatic parallelization techniques may be used on the DOALL loops. In order to extract further thread level parallelism (TLP) from an application, subsequent techniques may be used to attempt to parallelize the non-DOALL loops despite cross-iteration dependences. Examples include helper threading and speculative automatic parallelization.
Regarding the first example, in helper threading, a helper thread executes an abbreviated version of an original loop on a different hardware thread that may provide preparatory work ahead of the actual execution work of loops. For example, memory reference address calculations and prefetching of data may occur ahead of the execution of the work to perform an algorithm or method of the loop. A separate helper thread and the main thread typically share at least one level of the cache. The helper thread attempts to prefetch data into the shared cache in order that the main thread retrieves data directly from the shared cache without accessing a lower-level memory due to misses to the shared cache. An example of helper threading is provided in Y. Song et al., Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors, IEEE PACT, 2005, pp. 99-109.
Regarding the second example, in speculative automatic parallelization, hardware transactional memory support may be used such as the underlying hardware's checkpoint/commit capability to speculatively execute a loop. Again, the loop iterations may be divided among the main thread and non-main threads. Each non-main thread will speculatively execute the loop body, wherein the loop body is encapsulated inside a checkpoint/commit region. A transaction failure will trigger either retries of the same speculative execution, or waiting to execute the work non-speculatively after the previous logical thread has completed its work. This technique may utilize additional hardware support to detect a transaction failure and trigger suitable remedial action.
In order to further extract TLP from software applications and increase system performance of multi-threaded architectures, a method may be desired that performs further preparatory work of each of the main and non-main threads with reduced design complexity and overhead to monitor and manage conflicts. Also, a method may be desired that increases system throughput by sequential in-program order execution of threads of non-DOALL loops without additional hardware support for detecting transaction failures of speculatively executed iterations of non-DOALL loops. In view of the above, efficient methods and mechanisms for automatically controlling run-time parallelization of a software application are desired.