1. Field of the Invention
The present invention generally relates to data processing. More specifically, the present invention relates to a process for parallel application load balancing and distributed work management in parallel computer systems.
2. Description of the Related Art
One approach to developing very powerful computer systems is to design highly parallel systems where the processing activity of thousands of processors may be coordinated to perform computing tasks. These systems have proved to be highly useful for a broad variety of applications including, financial modeling, hydrodynamics, quantum chemistry, astronomy, weather modeling and prediction, geological modeling, and prime number factoring, to name but a few examples.
One family of parallel computing systems has been (and continues to be) developed by International Business Machines (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system, and currently Blue Gene/L systems have been configured with as many as 65,536 (216) compute nodes. Each compute node includes a single application specific integrated circuit (ASIC) with 2 CPU's and memory. The Blue Gene architecture has been extremely successful and on Oct. 27, 2005, IBM announced that a Blue Gene/L system had reached an operational speed of 280.6 teraflops (280.6 trillion floating-point operations per second), making it the fastest computer in the world at that time. Further, as of June 2005, Blue Gene/L installations at various sites world-wide were among 5 out of the 10 top most powerful computers in the world.
IBM is currently developing a successor to the Blue Gene/L system, named Blue Gene/P. Blue Gene/P is expected to be the first computer system to operate at a sustained 1 petaflops (1 quadrillion floating-point operations per second). Like the Blue Gene/L system, the Blue Gene/P system is a scalable system with a planned system having 73,728 compute nodes. Each Blue Gene/P node includes a single application specific integrated circuit (ASIC) with 4 CPU's and memory. A complete Blue Gene/P system is projected to include 72 racks with 32 node boards per rack.
In addition to the Blue Gene architecture developed by IBM, other computer systems may have similar architectures or otherwise provide a parallel architecture using hundreds, thousands or even hundreds of thousands of processors. Other examples of a parallel computing system include clustered systems and grid based systems. For example, the Beowulf cluster is one well known clustering architecture. A Beowulf cluster is a group of computer systems each running a Unix-like operating system (typically a version of the Linux® or BSD operating systems). Nodes of the cluster are connected over high speed networks and have libraries and programs installed which allow processing to be shared among one another. Essentially, the processing power of multiple commodity computer systems is chained together to function cooperatively. Libraries such as the Message Passing Interface (MPI) library may be used for node-to-node communications. MPI provides a standard library for communication among the nodes running a parallel program on a distributed memory system. MPI implementations consist of a library of routines that can be called from Fortran, C, C++ and Ada programs. Further, computer systems are available that provide support for symmetric multi processing (SMP) using multiple CPUs in a single system, and single CPUs are available with multiple processing cores.
Each of these architectures allows for parallel computing. Generally, parallel computing refers to a process of executing a single task on multiple processors to obtain results more quickly. Parallel computing techniques typically solve a problem by dividing a large problem into smaller tasks which may be executed simultaneously in a coordinated manner. For example, a common design pattern encountered in parallel computing problems is performing essentially the same calculations or operations for different data sets or work units. For these types of applications, a master node may divide a problem into individual work units and distribute the work units to a collection of worker nodes. Each worker node then performs the appropriate operations on the work units assigned to that node. Because tens, hundreds, or thousands of nodes are performing the same calculations (on different data sets), extremely large datasets may be processed in a relatively short period of time. Many software programs have been developed that use this master/worker paradigm, whether used in supercomputing applications developed for a Blue Gene or similar system, or for applications developed for clusters, multi-processor SMP systems or multi-core processors.
The idea behind the master/worker design pattern is that one node is designated as the “master” node, and other nodes are designated as workers. The master generates work units and distributes them to the worker pool. In turn, an available (or selected) worker node consumes the work unit. Depending on the workload, there are several strategies for workload distribution. Among the most common are round-robin or next available strategies.
The master/worker approach is an excellent technique for developing programs to run in a parallel environment. However, this approach does not scale well when the master node must coordinate and manage the activity of large numbers of worker nodes. Depending on the work load, the generation of work units by the master can easily become a bottleneck in completing a computing task, as many workers may sit idle waiting for work units to be assigned or made available. For example, depending on the problem, the master node may take a long time to generate a work unit relative to the time it takes a worker unit to process one. In this case, a high master to worker ratio is required. At the worker end of the master/worker paradigm, when the time required to consume a work unit is very small, the overhead of producing an adequate supply of work units can become a bottleneck on overall system performance. In this case, a high master to worker ratio is also required. However, the nodes used as master nodes may be unavailable for work unit consumption, leading to system inefficiency.
Further, if the time required for a node to process a work unit takes a variable amount of time to complete, there can be a skew in the finishing time for all the workers. Near the end of a computing task, some nodes may be idle, and others still consuming work units. Given the number of nodes in highly parallelized super systems or large clusters and grids, operations that require even small amounts of idle time for any individual node often translate into large amounts of time for the system as a whole. One approach in such a situation is to divide the work units into smaller chunks so they are more evenly distributed. However, this division puts more stress on the master node, which as described, leads to bottlenecks in system performance. Due to these factors, the master/worker paradigm can lead to poor use of resources in some cases.
Accordingly, there is a need in the art for techniques for parallel application load balancing and distributed work management in a parallelized computer system.