It is hard to program parallel computers. Dealing with many processors at the same time, either explicitly or implicitly, makes parallel programs harder to design, analyze, build, and evaluate than their serial counterparts. However, using a fast serial computer to avoid the problems of parallelism is often not enough. There are always problems that are too big, too complex, or whose results are needed too soon.
Ideally, a parallel programming model or language should provide the same advantages we seek in serial languages: portability, efficiency, and ease of expression. However, it is typically impractical to extract parallelism from sequential languages. In addition, previous parallel languages have generally ignored the issue of nested parallelism, where the programmer exposes multiple simultaneous sources of parallelism in an algorithm. Supporting nested parallelism is particularly important for irregular algorithms, which operate on non-uniform data structures (for example, sparse arrays, trees and graphs).
Parallel Languages and Computing Models
The wide range of parallel architectures make it difficult to create a parallel computing model that is portable and efficient across a variety of architectures. Despite shifts in market share and the demise of some manufacturers, users can still choose between tightly-coupled shared-memory multiprocessors such as the SGI Power Challenge, more loosely coupled distributed-memory multicomputers such as the IBM SP2, massively-parallel SIMD machines such as the MasPar MP-2, vector supercomputers such as the Cray C90, and loosely coupled clusters of workstations such as the DEC SuperCluster. Network topologies are equally diverse, including 2D and 3D meshes on the Intel Paragon and ASCI Red machine, 3D tori on the Cray T3D and T3E, butterfly networks on the IBM SP2, fat trees on the Meiko CS-2, and hypercube networks on the SGI Origin2000. With extra design axes to specify, parallel computers show a much wider range of design choices than do serial machines, with each choosing a different set of tradeoffs in terms of cost, peak processor performance, memory bandwidth, interconnection technology and topology, and programming software.
This tremendous range of parallel architectures has spawned a similar variety of theoretical computational models. Most of these are variants of the original CRCW PRAM model (Concurrent-Read Concurrent-Write Parallel Random Access Machine), and are based on the observation that although the CRCW PRAM is probably the most popular theoretical model amongst parallel algorithm designers, it is also the least likely to ever be efficiently implemented on a real parallel machine. That is, it is easily and efficiently portable to no parallel machines, since it places more demands on the memory system in terms of access costs and capabilities than can be economically supplied by current hardware. The variants handicap the ideal PRAM to resemble a more realistic parallel machine, resulting in the locality-preserving H-PRAM, and various asynchronous, exclusive access, and queued PRAMs. However, none of these models have been widely accepted or implemented.
Parallel models which proceed from machine characteristics and then abstract away details--that is, "bottom-up" designs rather than "top-down"--have been considerably more successful, but tend to be specialized to a particular architectural style. For example, LogP is a low-level model for message-passing machines, while BSP defines a somewhat higher-level model in terms of alternating phases of asynchronous computation and synchronizing communication between processors. Both of these models try to accurately characterize the performance of any message-passing network using just a few parameters, in order to allow a programmer to reason about and predict the behavior of their programs.
However, the two most successful recent ways of expressing parallel programs have been those which are arguably not models at all, being defined purely in terms of a particular language or library, with no higher-level abstractions. Both High Performance Fortran and the Message Passing Interface have been created by committees and specified as standards with substantial input from industry, which has helped their widespread adoption. HPF is a full language that extends sequential Fortran with predefined parallel operations and parallel array layout directives. It is typically used for computationally intensive algorithms that can be expressed in terms of dense arrays. By contrast, MPI is defined only as a library to be used in conjunction with an existing sequential language. It provides a standard message-passing model, and is a superset of previous commercial products and research projects such as PVM and NX. Note that MPI is programmed in a control-parallel style, expressing parallelism through multiple paths of control, whereas HPF uses a data-parallel style, calling parallel operations from a single thread of control.
Nested and Irregular Parallelism
Neither HPF or MPI provide direct support for nested parallelism or irregular algorithms. For example, consider the quicksort algorithm set forth below. The irregularity comes from the fact that the two subproblems that quicksort creates are typically not of the same size; that is, the divide-and-conquer algorithm is unbalanced.
procedure QUICKSORT(S):
if S contains at most one element PA1 then PA1 else PA1 end
return S PA2 begin PA2 return (QUICKSORT(S.sub.1) followed by S.sub.2 followed by QUICKSORT(S.sub.3))
choose an element .alpha. randomly from S; PA3 let S.sub.1, S.sub.2 and S.sub.3 be the sequences of elements in S PA4 less than, equal to, and greater than .alpha., respectively;
Although it was originally written to describe a serial algorithm, the pseudocode shown above contains both data-parallel and control-parallel operations. Comparing the elements of the sequence S to the pivot element .alpha., and selecting the elements for the new subsequences S.sub.1, S.sub.2, and S.sub.3 are inherently data-parallel operations. Meanwhile, recursing on S.sub.1 and S.sub.3 can be implemented as a control-parallel operation by performing two recursive calls in parallel on two different processors.
Note that a simple data-parallel quicksort (such as one written in HPF) cannot exploit the control parallelism that is available in this algorithm, while a simple control-parallel quicksort (such as one written in a sequential language and MPI) cannot exploit the data parallelism that is available. For example, a simple control-parallel divide-and-conquer implementation would initially put the entire problem onto a single processor, leaving the rest of the processors unused. At the first divide step, one of the subproblems would be passed to another processor. At the second divide step, a total of four processors would be involved, and so on. The parallelism achieved by this algorithm is proportional to the number of threads of control, which is greatest at the end of the algorithm. By contrast, a data-parallel divide-and-conquer quicksort would serialize the recursive applications of the function, executing one at a time over all of the processors. The parallelism achieved by this algorithm is proportional to the size of the subproblem being operated on at any instant, which is greatest at the beginning of the algorithm. Towards the end of the algorithm there will be fewer data elements in a particular function application than there are processors, and so some processors will remain idle.
By simultaneously exposing both nested sources of parallelism, a nested parallel implementation of quicksort can achieve parallelism proportional to the total data size throughout the algorithm, rather than only achieving full parallelism at either the beginning (in data parallelism) or the end (in control parallelism) of the algorithm. The benefits of a nested parallel implementation are illustrated in more detail in Hardwick, Jonathan, C., Practical Parallel Divide-and-Conquer Algorithms, PhD Thesis, Carnegie Mellon University, December 1997, CMU-CS-97-197.
Divide-and-Conquer Algorithms
Divide-and-conquer algorithms solve a problem by splitting it into smaller, easier-to-solve parts, solving the subproblems, and then combining the results of the subproblems into a result for the overall problem. The subproblems typically have the same nature as the overall problem (for example, sorting a list of numbers), and hence can be solved by a recursive application of the same algorithm. A base case is needed to terminate the recursion. Note that a divide-and-conquer algorithm is inherently dynamic, in that we do not know all of the subtasks in advance.
Divide-and-conquer has been taught and studied extensively as a programming paradigm, and can be found in any algorithm textbook. In addition, many of the most efficient and widely used computer algorithms are divide-and-conquer in nature. Examples from various fields of computer science include algorithms for sorting, such as merge-sort and quicksort, for computational geometry problems such as convex hull and closest pair, for graph theory problems such as traveling salesman and separators for VLSI layout, and for numerical problems such as matrix multiplication and fast Fourier transforms.
The subproblems that are generated by a divide-and-conquer algorithm can typically be solved independently. This independence allows the subproblems to be solved simultaneously, and hence divide-and-conquer algorithms have long been recognized as possessing a potential source of control parallelism. Additionally, all of the previously described algorithms can be implemented in terms of data-parallel operations over collection-oriented data types such as sets or sequences, and hence, data parallelism can be exploited in their implementation. However, there is still a need to define a model in which to express this parallelism. Previous models have included fork-join parallelism, and the use of processor groups, but both of these models have severe limitations. In the fork-join model available parallelism and the maximum problem size is greatly limited, while group-parallel languages have been limited in their portability, performance, and/or ability to handle irregular divide-and-conquer algorithms.