High performance computing (HPC) involves the use of parallel supercomputers and/or computer clusters. A computer cluster is a computing system that consists of multiple (usually mass-produced) processors linked together forming a single system.
Parallel computing typically refers to the simultaneous use of multiple computer resources to solve a computational problem. The multiple computer resources could be a single computer with multiple processors, an arbitrary number of computers or nodes connected via a network, or a combination thereof.
Parallel computing saves time and is advantageous for solving larger problems. Parallel computing is currently used in a number of industry segments, which for example include, the energy industry (e.g. for seismic analysis, and reservoir analysis), the financial industry (e.g., for derivative analysis, actuarial analysis, asset liability management, portfolio risk analysis, and statistical analysis), manufacturing (e.g., for mechanical or electric design, process simulation, finite element analysis, and failure analysis), life sciences (e.g., for drug discovery, protein folding, and medical imaging), media (e.g., for bandwidth consumption analysis, digital rendering, and gaming), government (e.g., for collaborative research, weather analysis, and high energy physics, etc. Uses of such parallel computing in other areas are of course possible.
In high performance computing, multiple types of parallel computer architectures exist, which for example include shared multiprocessor systems and distributed memory systems. For example, a Shared Multi-Processor (SMP) system typically includes multiple processors sharing a common memory system.
In a distributed memory system, a cluster is defined by multiple nodes that communicate with each other using a high speed interconnect. A node typically includes a collection of cores or processors that share a single address space. Each node has its own CPU, memory, operating system, and I/O subsystem (e.g., a computer box with one or multiple processors or cores is a node). In a distributed memory system, a master node is typically assigned, which is configured to divide work between several slave nodes communicatively connected to the master node. The slave nodes work on their respective tasks and intercommunicate among themselves if there is any need to do so. The slave nodes return back to the master node. The master node assembles the results and further distributes work.
A SMP is more expensive and less scalable than a Massively Parallel Processor (MPP) system. However, programming is easier in an MPP system because all data is available to all processors.
A disadvantage with a distributed memory system is that each node has access to only its own memory. A further disadvantage is that data structures must be duplicated and sent over the network if other nodes want access to them, leading to network problems.
In high performance computing, there are multiple programming models. There is a single program multiple data (SPMD) model and a multiple program multiple data (MPMD) model. In a SPMD model, a single program is run on multiple processors with different data. In a MPMD model, different programs are run on different processors and different tasks may use different data.
For SPMD, in order to have an executable program run on multiple CPUs, a protocol or interface is required to obtain parallelism. Methods to obtain parallelism include automatic parallelization (auto-parallel), requiring no source code modification, open multi-processing (OpenMP), requiring slight source code modification, or a message passing system such as Message Passing Interface (MPI), a standard requiring extensive source code modification. Hybrids such as auto-parallel and MPI or OpenMP and MPI are also possible.
Two versions of the MPI standard are currently popular: Version 1.2 (MPI-1), and Version 2.1 (MPI-2). MPI has become a de facto standard for communication among processes that model a parallel program running on a distributed memory system. Most MPI implementations consist of a specific set (library) of routines (API) that can be called from Fortran, C, C++, or from any other language capable of interfacing with such routine libraries.
The assignee of the present application is an implementer of the MPI standard. Also, an implementation known as MPICH is available from the Argonne National Laboratory's website www.anl.gov. Argonne National Laboratory has continued developing MPICH, and now offers MPICH 2, which is an implementation of the MPI standard.
An example of an MPI call is init MPI_Init(int *argc, char ***argv), used for MPI initialization, which is the first routine called.
Different processes have ID numbers known as ranks. Ranks are used to identify the source and destination of a message, as well as to allow different processors to execute different code simultaneously. Rank is defined as a number ranging from 0 to size-1 (where size is the total number of processes), which identifies a process uniquely. The rank of each running process in an MPI application is set up by an MPI call MPI_Comm_Rank( ) at runtime. The ranks remain unchanged throughout the lifetime of the MPI application.
Point-to-point communication is communication between two processes. A source process sends a message to a destination process. A destination process receives the message. Communication takes place within a communicator. The destination process is identified by its rank within the communicator. MPIs send calls include MPI_Send (Standard), which lets MPI decide whether outgoing messages will be buffered; MPI_BSend (Buffered), which can be started whether or not a matching receive has been posted, which may complete before a matching receive has been posted, MPI_SSend (Synchronous), which can be started whether or not a matching receive as been posted and which will complete successfully only if a matching receive is posted; and MPI_RSend(Ready) which completes immediately and which can be started only if the matching receive has already been posted.
MPI_Bcast is an MPI call using which a selected processor broadcasts or sends to all other processors. MPI_Scatter( ) spreads an array to other processors. The source is an array on the sending processor. Each receiver, including the sender, gets a piece of the array corresponding to its rank in the communicator.
These are just a few of multiple function calls available for MPI. Others can easily be learned by reviewing readily available information about MPI.