1. Field
This disclosure relates generally to a high performance computing cluster and, more specifically to techniques for dynamically assigning jobs to processors in a high performance computing cluster.
2. Related Art
The term high performance computing (HPC) has typically been used to refer to a parallel computing system that includes multiple processors linked together with commercially available interconnects. Usually, computing systems that operate at or above the teraflops (109 floating point operations/second) region are considered HPC systems. HPC systems increasingly dominate the world of supercomputing due to their flexibility, power, and relatively low cost. HPC has commonly been associated scientific research and engineering applications. Recently, HPC has been applied to business uses of cluster-based supercomputers, e.g., data warehouses, line-of-business (LOB) applications, and transaction processing. A computer cluster is a group of loosely coupled computers that closely work together. The components of a computer cluster are frequently connected to each other through fast local area networks (LANs). Computer clusters are usually deployed to improve performance and/or availability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed and/or availability.
A number of commercially available software applications are known that perform job scheduling for computer systems. For example, Portable Batch System™ is a software application that performs job scheduling. A primary task of Portable Batch System™ is to allocate batch jobs among available computing resources. Portable Batch System™ is supported as a job scheduler mechanism by several meta schedulers, which are designed to optimize computational workloads by combining multiple distributed resource managers into a single aggregated manager, allowing batch jobs to be directed to a best location for execution. As another example, LoadLeveler™ is a software application that performs job scheduling for batch jobs, while attempting to match job requirements with a best available computer resource for execution. As yet another example, Load Sharing Facility™ is another software application that performs job scheduling.
Message passing interface (MPI), which has been employed in computer clusters, is an application programmer interface (API) that facilitates communication between processors of a computer cluster. MPI includes point-to-point message passing and collective (global) operations, which may be directed to a user-specified group of processes. MPI has become the de facto standard for communication among processes that model a parallel program running on a distributed memory system. MPI provides a communication library that enables parallel programs to be written in various programming languages, e.g., C, C++, Fortran, etc. The advantages of MPI over older message passing libraries are portability (due to the fact that MPI has been implemented for almost every distributed memory architecture) and speed (as each implementation is in principle optimized for the hardware on which it runs).
There are two versions of MPI that are currently popular: MPI-1, which emphasizes message passing and employs a static runtime environment; and MPI-2, which also includes features such as parallel input/output (I/O), dynamic process management, and remote memory operations. MPI is often compared with parallel virtual machine (PVM), which is a legacy message passing system that provided motivation for standard parallel message passing systems such as MPI. PVM is an open source software application that employs transmission control protocol/internet protocol (TCP/IP) network communications to create a virtual supercomputer (i.e., an HPC cluster) using TCP/IP connected computer systems.
The MPI interface is designed to provide virtual topology, synchronization, and communication functionality between a set of processes (that have been mapped to processors) in a language independent way, with language specific syntax (bindings). Each process may be mapped to a different processor as part of a mapping activity, which usually occurs at runtime, through an agent that starts the MPI API. MPI facilitates point-to-point rendezvous-type send/receive operations, choosing between a Cartesian or graph-like logical process topology, exchanging data between process pairs (send/receive operations), combining partial results of computations (gathering and reduction operations), synchronizing processor nodes (barrier operations), as well as obtaining network-related information such as the number of processes in a computing session, current processor identity to which a process is mapped, neighboring processes accessible in a logical topology, etc.
MPI also specifies thread safe interfaces, which have cohesion and coupling strategies that usually avoid manipulation of unsafe hidden states within the interface. Multi-threaded collective communication may be accomplished by using multiple copies of communicators, which are groups of processes in an MPI session. In general, the groups of processes each have rank order and their own virtual communication fabric for point-to-point operations. Communicators also have independent communication addressability for collective communication. MPI groups are mainly utilized to organize and reorganize subsets of processes, before another communicator is made. MPI facilitates single group intra-communicator operations, as well as bi-partite (two-group) inter-communicator operations. In MPI-1, single group operations are most prevalent. In MPI-2 bi-partite operations are more widely employed to facilitate collective communication and dynamic process management.
Communicators can be partitioned using several commands in MPI, these commands include a graph-coloring-type algorithm (MPI_COMM_SPLIT), which is commonly used to derive topological and other logical subgroupings in an efficient way. A number of important functions in the MPI API involve communication between two specific processes. For example, an MPI_Send interface allows one specified process to send a message to a second specified process. Point-to-point operations are particularly useful in master-slave program architectures, where a master node might be responsible for managing data-flow of a collection of slave nodes. Typically, the master node sends specific batches of instructions or data to each slave node, and possibly merge results upon completion. Collective functions in the MPI API involve communication between all processes in a process group (which may include an entire process pool or a program-defined subset).
An MPI_Bcast call (MPI broadcast) takes data from one specially identified node and sends that message to all processes in a process group. A reverse operation is the MPI_Reduce call, which is designed to take data from all processes in a group, perform a user-chosen operation (like summing), and store the results on one individual node. This MPI_Reduce call is also useful in master-slave architectures, where a master node may want to sum results from all slave nodes to arrive at a final result.
Researches have proposed implementing MPI directly into hardware of a system by building MPI operations into micro-circuitry of random access memory (RAM) chips in each node. Another approach has proposed adding hardware acceleration to one or more parts of an MPI operation. For example, MPI queues may be processed with hardware or remote direct memory access (RDMA) may be employed to directly transfer data between memory and a network interface without processor or kernel intervention. Many MPI implementations allow multiple, different, executables to be started in the same MPI job. A process may be mapped to N physical processors, where N is the total number of processors available, or something in between. For maximum potential for parallel speedup, more processors are used, but the ability to separate the mapping from program design is an essential value for development, as well as for practical situations where resources are limited.
New architectures are being developed that have greater internal concurrency (multi-core), better fine-grain concurrency control (multi-threading), and more levels of memory hierarchy. This has resulted in separate complementary standards for symmetric multiprocessors (SMPs), e.g., OpenMP™. In general, the MPI standard provides little guidance on how multi-threaded programs should be written. While multi-threaded capable MPI implementations exist, multi-threaded message passing applications are somewhat limited.