Computer systems typically comprise a combination of hardware, such as semiconductors, transistors, chips, and circuit boards, and computer programs. As increasing numbers of smaller and faster transistors can be integrated on a single chip, new processors are designed to use these transistors effectively to increase performance. Currently, many computer designers opt to use the increasing transistor budget to build ever bigger and more complex uni-processors. Alternatively, multiple smaller processor cores can be placed on a single chip, which is beneficial because a single, simple processor core is less complex to design and verify. This results in a less costly and complex verification process, as a once verified module, the processor, is repeated multiple times on a chip.
A technique known as parallel computing takes advantage of multi-processors. Parallel computing is the partitioning or dividing of an algorithm into units, often called threads, which are simultaneously or concurrently executed on multiple processors. The intermediate results of these multiple threads are then combined into a final result. Thus, parallel computing is based on the idea that the process of solving a problem usually can be divided into smaller tasks, which may be carried out simultaneously with some coordination. Parallel computing is valuable because performing a large task by the parallel execution of smaller tasks can be faster than performing the large task via one serial (non-parallel) algorithm.
The parallel threads are often implemented on computer systems that include multiple processors and/or on multiple computer systems (often called compute nodes or simply nodes) that comprise processors, which run the parallel threads or local instances of global applications to accomplish tasks. The parallel thread or threads local to a particular node need a way to communicate with other parallel threads, which is often accomplished via a technique known as message passing. To ensure proper communication between various nodes, a standard known as the Message Passing Interface (MPI) has been developed.
Under the MPI standard, an MPI program consists of autonomous processes, executing their own code, which need not be identical. Typically, each process or application communicates via calls to MPI communication primitives, where each process executes in its own and shared memory. Such message passing allows the local processors comprising the node and applications running thereon (a thread or instance of the global application or process) to cooperate with each other. MPI is available on a wide variety of platforms, ranging from networks of workstations to massively parallel systems.
Massively parallel systems often use Direct Memory Address (DMA) technology, which reduces processor workload in the management of memory operations required for messaging. DMA engines, also known as message units, work in conjunction with a local thread to implement the MPI application. Workload that would normally need to be processed by a processor at a node is instead handled by the DMA engine.