The subject matter generally relates to high performance computing (HPC) which in turn involves the use of parallel supercomputers and/or computer clusters. A computer cluster is a computing system that consists of multiple (usually mass-produced) processors linked together forming a single system.
Parallel computing typically refers to the simultaneous use of multiple computer resources to sole a computational problem. The multiple computer resources could be a single computer with multiple processors, an arbitrary number of computers or nodes connected via a network, or a combination thereof.
Parallel computing saves time and is advantageous for solving large problems. Parallel computing is currently used in a number of industry segments, which for example include, the energy industry (for seismic analysis, and reservoir analysis), the financial industry (for derivative analysis, actuarial analysis, asset liability management, portfolio risk analysis, and statistical analysis), manufacturing (for mechanical or electric design, process simulation, finite element analysis, and failure analysis), life sciences (for pharmaceutical discovery, protein folding, and medical imaging), media (for bandwidth consumption analysis, digital rendering, and gaming), government (for collaborative research, weather analysis, and high energy physics), et cetera. A use of such parallel computing in other areas is of course possible.
In high performance computing, multiple types of parallel computer architectures exist, which for example include shared multiprocessor systems and distributed memory systems. For example, a Shared Multi-Processor (SMP) system typically includes multiple processors sharing a common memory system.
In a distributed memory system, a cluster is defined by multiple nodes that communicate with each other using a high speed interconnect. A node typically includes a collection of cores or processors that share a single address space. Each node has its own CPU, memory, operating system, and I/O subsystem (for example, a computer box with one or multiple processors or cores is a node). In a distributed memory system, a master node is typically assigned, which is configured to divide work between several slave nodes communicatively connected to the master node. The slave nodes work on their respective tasks and intercommunicate among themselves if there is any need to do so. The slave nodes return back to the master node. The master node assembles the results and further distributes work.
In high performance computing, there are multiple programming models. There is a single program multiple data (SPMD) model and a multiple program multiple data (MPMD) model. In a SPMD model, a single program is run on multiple processors with different data. In a MPMD model, different programs are run on different processors and different tasks may use different data.
For SPMD, in order to have an executable program run on multiple CPUs, a protocol or interface is required to obtain parallelism. Methods to obtain parallelism include automatic parallelization (auto-parallel), requiring no source code modification, open multi-processing (OpenMP), requiring slight source code modification, or a message passing system such as Message Passing Interface (MPI), a standard requiring extensive source code modification. Hybrids such as auto-parallel and MPI or OpenMP and MPI are also possible.
MPI is a language-independent communications protocol used to program high performance computing applications and is ubiquitous in HPC environment. MPI has become a de facto standard for communication among processes that model a parallel program running on a distributed memory system. Most MPI implementations consist of a specific set (library) of routines (API) that can be called from Fortran, C, C++, or from any other language capable of interfacing with such routine libraries.
The assignee of the present application is an implementer of the MPI standard. Also, an implementation known as MPICH is available from the Argonne National Laboratory. The Argonne National Laboratory has continued developing MPICH, and now offers MPICH 2, which is an implementation of the MPI standard. Specifics regarding MPI can easily be learned by reviewing readily available information about MPI.
Most power management techniques currently focus on reducing the compute capacity of a system or group of systems in a cluster to save/limit total power usage. Saving power/energy in an idle system or under utilized systems is a well-known technique.