1. Technical Field
This invention generally relates to computer systems and development, and more specifically relates to improving the block allocation time in a supercomputer or distributed computer system via system image and data file pre-loading.
2. Background Art
Supercomputers continue to be developed to tackle sophisticated computing jobs. These computers are particularly useful to scientists for high performance computing (HPC) applications including life sciences, financial modeling, hydrodynamics, quantum chemistry, molecular dynamics, astronomy and space research and climate modeling. Supercomputer developers have focused on massively parallel computer structures to solve this need for increasingly complex computing needs. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/P system is a scalable system in which the maximum projected number of compute nodes is 73,728. The Blue Gene/P node consists of a single ASIC (application specific integrated circuit) with 4 CPUs and memory. The full computer would be housed in 72 racks or cabinets with 32 node boards in each.
In the Blue Gene supercomputers, and other supercomputers, the compute nodes are arranged in clusters of compute and I/O nodes that communicate over a service network to a control system in a service node. One or more clusters of computer hardware are allocated into a block to run a software application. The compute and I/O nodes have volatile memory for their operating systems that must be loaded via the service network each time a block of hardware is allocated for a software application to be run. This prior art approach to block allocation results in each job taking longer to run as system time is used loading the operating system images to the hardware, allowing the I/O node and compute node kernels to complete their boot and then all the hardware reporting to the control system. It is only after this process has completed that the control system may begin the process of loading the application or job to the block for execution. On a massively parallel super computer system like Blue Gene, utilization of the system is important due to the high cost of the overall system. Thus it is advantageous to be able to decrease the overall system down time by reducing the block allocation time.
Distributed computer systems have a similar overall architecture as a massively parallel computer system. However, instead of a set of possibly identical compute nodes interconnected in a single location, the distributed computer has a number of compute nodes that may not be homogeneous and may be remotely located. A distributed computer system can have a similar problem as described above in that work cannot be allocated to a distributed compute node or block of nodes until the compute entity has system and data files necessary for performing designated tasks.
Without a way to reduce the block allocation time, super computers and distributed computers will continue to need to wait to load operating system images and data files into all hardware blocks before proceeding with the process of loading applications thereby wasting potential computer processing time.