The present invention relates generally to parallel programs, and specifically, to a method for automatic parallelization of certain programs.
Parallel programming is a difficult and expensive process. While tools have been developed to assist with parallelization, these tools still require that the user have a deep knowledge of parallelizationxe2x80x94a topic not within the grasp of most users who could benefit from parallelizing, especially when the programs at issue are inherently parallel. Furthermore, the target platforms for these programs is often a specialty multiprocessing computer.
A program which is inherently parallel implements algorithms that lend themselves to straightforward parallel implementation. This type of program does not require use of a specialty multiprocessing computer, and can be effectively implemented on a network of workstations (NOW) instead. In parallelizing programs for a NOW, a number of difficulties arise: communication is not cheap, faults and slowdowns are common, and machines may join and leave the computation at any time.
The most straightforward way to provide for communication in a parallel program is through the use of a shared memory. This is easy to implement on a multiprocessor, because every memory location has a unique id in the global address space. On a NOW, a software layer called a Distributed Shared Memory (DSM) can create an illusion of a physical global address space.
Further, a method known as Two-Phase Idempotent Eager Scheduling (TIES) was created to address the problem of slowdowns and failures, as well as machines joining and leaving computation, when parallelizing on a NOW. The traditional method of parallelization called for division of a sequential application into alternating serial and parallel steps. Each parallel step was composed of a number of independent threads of control that could be executed simultaneously. While serial steps would be executed locally, individual threads of parallel steps were scheduled on available processors. When the last thread terminates, the parallel step is finished. The TIES system instead simultaneously assigns several copies of individual threads to several machines to guarantee termination as long as at least one continues to progress. When the first has successfully reported the results, the other machines are freed for other tasks. In order to guarantee correctness, the starting memory state must be identical for each copy, and thus, this original memory state is preserved until the parallel step has successfully concluded. Not only does this address the issues of failure and slowdown and machines joining and leaving the computation, but also the TIES scheme minimizes communications during the parallel step because the complete memory image for each thread is available at the start of the parallel step.
An embarrassingly parallel program is a program whose execution time is dominated by a small number of coarse-grained parallel loops. The lack of internode communication requirements within the main loops of the programs is ideal for use with the TIES methodology, where no inter-iteration communication is permitted.
Another problem that arises in parallelization is the issue of data dependence, which occurs particularly if two iterations access the same memory location and at least one access is a write. Some dependencies can be satisfied through a technique called privatization, which allocates individual copies of a variable to each loop iteration. However, if an earlier iteration writes a value which is to be read by a later iteration, a flow dependence is indicated, and a loop containing this type of dependence may not be parallelizable. A dependence may also be categorized as either loop-independent, if it exists independently of any loop inside which it is nested, or loop-carried, if not. While a loop-independent dependence does not inhibit parallelization, a loop-carried dependence may inhibit parallelization.
Detecting dependencies has traditionally been done by comparing each possibly overlapping array reference with every other. Array summarization techniques, representing aggregate activity for a set of accesses have been proposed. However, in real world applications, flow dependence has been shown to be undecidable. In part, this is because static analysis cannot completely address analysis issues which arise in complex real-world applications.
The following program illustrates this problem.
The main loop of this program has different dependencies depending on the runtime value of argv[1]. If argv[1] is between xe2x88x925 and xe2x88x921, there is a loop-carried flow dependence. If argv[1] is between 1 and 5, there is a loop-carried anti-dependence, which means an earlier iteration reads a value which is written by a later iteration. If argv[1] is 0, there is a loop-independent flow dependence. Other values of argv[1] show no dependence. Clearly static analysis may not be able to provide sufficient information to permit parallelization.
Because static analysis is not adequate, various approaches of runtime analysis have been suggested. In the previous example, a runtime test on the value of argv[1] Would determine whether a serial or parallel version of the loop was selected. One strategy is to use xe2x80x9cinspector loopsxe2x80x9d, or side-effect-free skeletons of the original loop in which address calculations are performed. The inspector loop executes quickly, and then the pattern of memory accesses can be used to allow or disallow parallelization of the original loop. The drawback to this strategy is that the inspector loop may be difficult or impractical to extract. Another strategy is speculative parallelization, in which the loops are parallelized with possible dependencies, the actual existence of which is checked at run-time. A shadow array is allocated for every array that might contain a flow dependence. As the elements in the original array are accessed, the shadow array records the access type. When the parallel step is over, the access patterns of all iterations can be compared to determine if there is an actual flow dependence. If it turns out to be illegal, a backup copy of the shared memory is used to re-execute the loop sequentially. Drawbacks to this method include mangling of code and high overhead for parallelization of coarse grained loops.
Thus, a method that will automatically parallelize code for implementation on a NOW, while avoiding the problems and drawbacks detailed above is desirable.
The present invention provides a scalable, automated, network friendly method for building parallel applications from embarrassingly parallel serial programs. The steps of an exemplary method follow: First, the application loops with significant potential parallelism are identified. Second, from the set of loops identified, those loops which can statically be shown to not be parallelizable are disqualified. Next, the program is transformed into a parallel form in which the remaining identified loops are optimistically parallelized and packaged into per-iteration functions. Variables that need to be shared are segregated into a particular section of memory via simple source code transformation. The previous steps are performed by the compiler, which further inserts calls to the runtime system for scheduling and shared memory allocation and set-up.
The server application, then, contains all code outside of any parallel loop as well as calls to the scheduling subsystem that replace the parallelized loop. The client applications contain the loop bodies of the parallelized loops (per-iteration functions) Each client application is constructed to accept a consecutive range of iteration space as well as the iteration step as input parameters.
At run-time, many copies of the per-iteration functions are spawned on different machines to simultaneously perform different iteration ranges. The shared memory segment is available to all the machines through DSM. During execution, the memory access statistics on the granularity of individual data types are collected as the loop iterations are performed. As the ranges of iterations are concluded, shared memory updates and access statistics are returned to the memory server. The server then dynamically checks that no memory dependencies were in fact violated. If a violation is detected, the program can either abort in ordinary applications or can gracefully continue serially in sensitive applications. In subsequent runs, the program is re-parallelized, but the parallelization of the offending loop is disabled. Information obtained through the program failures is entered into a program database which guides future re-parallelizations.