Exploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. In particular, many scientific computations naturally decompose into small tasks that can be assigned to different nodes or processors of a multi-node architectural system when the application is parallelized. Examples of such tasks include modifying individual positions and velocities within a particle simulation, or updating the local state variables of a Navier-Stokes fluid dynamics model.
When the number of tasks per processor is large and the data is readily available, the tasks can be efficiently executed in a series of tight loops, each of which evaluates a single type of task for many different data inputs. The overhead of invoking these loops is small compared to the total compute time. When there are few tasks per processor, however, or when the tasks must wait for the arrival of data from other processors, the overheads of communication latency and synchronization can become a significant portion of the overall computation time, and it is much more challenging to keep the processors busy with useful work.
For instance, an individual node in a multi-node system typically takes input data and performs various computations on that input data. It then passes the results of such computation to other nodes that are waiting for it. In this arrangement, processing cores within a node are sometimes ready, willing, and able to carry out some computational task, but the data required to perform the task is not yet available. This means the core is idle, which is an inefficient use of computational resources.