A typical distributed computer system includes multiple interconnected nodes. Each node in the distributed computer system may include a separate processor. Accordingly, applications which execute in parallel on the distributed computer system are able to exploit the processing power provided by interconnection of the processors. For example, by combining the processing power provided by the multiple interconnected nodes, a given computation may be executed much faster by splitting the computation into multiple segments and executing the each segment of the application in parallel rather than executing the application serially on a single node.
Executing an application across several nodes typically involves determining which portions of the application should be performed serially and which portions of an application may be performed in parallel (i.e., the portion is safe to be performed in parallel). A portion of the application is deemed as parallelizable if the portion may be divided into discrete segments such that each segment in the discrete segments may be executed by an individual thread simultaneously. In contrast, portions of the application that when parallelized would result in thread interdependencies (i.e., data dependencies between threads), such as multiple reads and writes to the same memory space by different threads, typically are not parallelized.
One method of parallelizing an application is for a programmer to analyze the application and determine how to parallelize an application. For example, the programmer may analyze a loop in the application to determine whether there are potential data dependencies between loop iterations within the loop of the application. Once the programmer has determined how to parallelize the loop, the programmer may add in specific instructions, such as message passing interface (MPI), to the application for parallelizing the loop in the application.
Another solution to parallelize the application is for a compiler to add in instructions for parallelizing the application statically at compile time. For the compiler to add the aforementioned instructions, the compiler must analyze the application for possible data dependencies, and determine how to break the application into discrete portions. Ensuring data dependencies are known is challenging if not impossible because many commonly occurring loops have memory accesses that preclude automatic parallelism. Specifically, the loop may have memory references which are only determined at execution time, such as subscripted subscripts (e.g., A[C[i]]=D[i]) and pointer variables (e.g., *ptr=0.50; ptr++).
Another possible solution is to perform the analysis during the execution time using the assumption that the loop is parallelizable. When thread interdependencies are discovered, the loop may be restarted from the beginning in serial or with a new attempt at parallelizing the loop with synchronization.