1. Technical Field
The disclosure and claims herein generally relate to multi-node computer systems, and more specifically relate to scheduling work in a multi-node computer system based on checkpoint characteristics for an application stored in a checkpoint profile.
2. Background Art
Supercomputers and other multi-node computer systems continue to be developed to tackle sophisticated computing jobs. One type of multi-node computer systems begin developed is a High Performance Computing (HPC) cluster called a Beowulf Cluster. A Beowulf Cluster is a scalable performance cluster based on commodity hardware, on a private system network, with open source software (Linux) infrastructure. The system is scalable to improve performance proportionally with added machines. The commodity hardware can be any of a number of mass-market, stand-alone compute nodes as simple as two networked computers each running Linux and sharing a file system or as complex as 1024 nodes with a high-speed, low-latency network.
A Beowulf cluster is being developed by International Business Machines Corporation (IBM) for the US Department of Energy under the name Roadrunner. Chips originally designed for video game platforms work in conjunction with systems based on x86 processors from Advanced Micro Devices, Inc. (AMD). IBM System x™ 3755 servers based on AMD Opteron™ technology are deployed in conjunction with IBM BladeCenter® H systems with Cell Enhanced Double precision (Cell eDP) technology. Designed specifically to handle a broad spectrum of scientific and commercial applications, the Roadrunner supercomputer design includes new, highly sophisticated software to orchestrate over 13,000 AMD Opteron™ processor cores and over 25,000 Cell eDP processor cores. The Roadrunner supercomputer will be capable of a peak performance of over 1.6 petaflops (or 1.6 thousand trillion calculations per second). The Roadrunner system will employ advanced cooling and power management technologies and will occupy only 12,000 square feet of floor space.
As the size of clusters continues to grow, the mean time between failures (MTBF) of clusters drop to the point that runtimes for an application may exceed the MTBF. Thus, long running jobs may never complete. The solution to this is to periodically checkpoint application state so that applications can be re-started and continue execution from known points. Typical checkpointing involves bringing the system to a know state, saving that state, then resuming normal operations. Restart involves loading a previously saved system state, then resuming normal operations. MTBF also limits systems scaling. The larger a system is, the longer it takes to checkpoint. Thus efficient checkpointing is critical to support larger systems. Otherwise, large systems would spend all of the time checkpointing.
What is needed are efficient checkpointing methods for multi node clusters. In a shared node cluster there may be many applications or jobs running simultaneously on a given node. Some of these application may want checkpoint support, others may not. The required frequency of checkpointing may also vary. Without a way to more efficiently checkpoint applications, multi-node computer systems will continue to suffer from reduced efficiency.