1. Technical Field
The disclosure and claims herein generally relate to hybrid-architecture, multi-node computer systems, and more specifically relate to using accelerators for checkpointing in a hybrid architecture system.
2. Background Art
Supercomputers and other multi-node computer systems continue to be developed to tackle sophisticated computing jobs. One type of multi-node computer systems begin developed is a High Performance Computing (HPC) cluster. A HPC cluster is a scalable performance cluster based on commodity hardware, on a private system network, with open source software (Linux) infrastructure. The system is scalable to improve performance proportionally with added machines. The commodity hardware can be any of a number of mass-market, stand-alone compute nodes as simple as two networked computers each running Linux and sharing a file system or as complex as 1024 nodes with a high-speed, low-latency network. One type of HPC cluster is a Beowulf cluster. However, the HPC clusters as described herein have considerably more power than what was originally considered in the Beowulf concept.
A HPC cluster is being developed by International Business Machines Corporation (IBM) for Los Alamos National Laboratory and the US Department of Energy under the name Roadrunner Project, which is named after the New Mexico state bird. (References to the IBM HPC cluster herein refer to this supercomputer.) In the IBM HPC cluster, chips originally designed for video game platforms work in conjunction with systems based on x86 processors from Advanced Micro Devices, Inc. (AMD). IBM System X™ 3755 servers based on AMD Opteron™ technology are deployed in conjunction with IBM BladeCenter® H systems with Cell Enhanced Double precision (Cell EDP) technology. Designed specifically to handle a broad spectrum of scientific and commercial applications, this HPC cluster design includes new, highly sophisticated software to orchestrate over 13,000 AMD Opteron™ processor cores and over 25,000 Cell EDP processor cores. The supercomputer will employ advanced cooling and power management technologies and will occupy only 12,000 square feet of floor space.
Computer systems such as the IBM HPC have a hybrid architecture. A hybrid architecture consists of a cluster of homogeneous processing elements each of which may have multiple accelerator elements of a different processing architecture available to them for use. These accelerator elements may be specialized units designed for specific problems or more general purpose processors. In the IBM HPC, each hybrid node has a host processor and two accelerators. Each node has a hierarchical communication network where the homogeneous element, often called the host element or host processor, serving as the root of the communication tree. In the IBM HPC, the host processor is an Advanced Micro Devices (AMD) Opteron™ processor core and the two accelerators are Cell EDP processor cores.
As the size of clusters continues to grow, the mean time between failures (MTBF) of clusters drops to the point that runtimes for an application may exceed the MTBF. Thus, long running jobs may never complete. The solution to this is to periodically checkpoint application state so that applications can be re-started and continue execution from known points. Typical checkpointing involves bringing the system to a know state, saving that state, then resuming normal operations. Restart involves loading a previously saved system state, then resuming normal operations. MTBF also limits systems scaling. The larger a system is, the longer it takes to checkpoint. Thus efficient checkpointing is critical to support larger systems. Otherwise, large systems would spend all of the time checkpointing.
What is needed are efficient checkpointing methods for multi node computer systems. In a multimode computer system or cluster checkpointing may substantially reduce the efficiency of the overall computer system or cluster. Without a way to more efficiently checkpoint applications, multi-node computer systems will continue to suffer from reduced efficiency.