1. Technical Field
The disclosure and claims herein generally relate to multi-node computer systems, and more specifically relate to optimized peer-to-peer file transfers on a multi-node computer system.
2. Background Art
Supercomputers and other multi-node computer systems continue to be developed to tackle sophisticated computing jobs. One type of multi-node computer systems begin developed is a High Performance Computing (HPC) cluster called a Beowulf Cluster. A Beowulf Cluster is a scalable performance cluster based on commodity hardware, on a private system network, with open source software (Linux) infrastructure. The system is scalable to improve performance proportionally with added machines. The commodity hardware can be any of a number of mass-market, stand-alone compute nodes as simple as two networked computers each running Linux and sharing a file system or as complex as 1024 nodes with a high-speed, low-latency network.
A Beowulf cluster is being developed by International Business Machines Corporation (IBM) for the US Department of Energy under the name Roadrunner. In a first-of-a-kind design, chips originally designed for video game platforms work in conjunction with systems based on x86 processors from Advanced Micro Devices, Inc. (AMD). IBM System X™ 3755 servers based on AMD Opteron™ technology are deployed in conjunction with IBM BladeCenter® H systems with Cell Enhanced Double precision (Cell eDP) technology. Designed specifically to handle a broad spectrum of scientific and commercial applications, the Roadrunner supercomputer design includes new, highly sophisticated software to orchestrate over 13,000 AMD Opteron™ processor cores and over 25,000 Cell eDP processor cores in tackling some of the most challenging problems in computing. The Roadrunner supercomputer will be capable of a peak performance of over 1.6 petaflops (or 1.6 thousand trillion calculations per second). Designed also with space and power consumption issues in mind, the Roadrunner system will employ advanced cooling and power management technologies and will occupy only 12,000 square feet of floor space, or approximately the size of three basketball courts.
Computer systems such as Roadrunner have a large number of nodes, each with its own processor and local memory but no disk drive for mass storage of data. The nodes are connected with communication network having several levels of Ethernet switches to one or more file servers. In multi-node, diskless clusters, such as the Roadrunner cluster, large amounts of data must be delivered to each node during the boot process. The file servers provide data, application and Kernel operating system files to the nodes. The enormous amount of data sent while booting nodes can affect the normal operation/administration of other nodes on the cluster due to loading on the networks and network switches.
What is needed is an efficient way to distribute the operating system kernels and files to the nodes to improve boot times and reduce switch loading, while reducing the hardware cost and network complexity of the cluster. Without a way to more efficiently distribute data on multiple nodes, multi-node computer systems will continue to suffer from reduced efficiency.