1. Technical Field
This disclosure generally relates to parallel computing systems, and more specifically relates to dynamically allocating a job or a processing unit (part of a job) on a multi-nodal, parallel computer system based on application specific metrics.
2. Background Art
Large, multi-nodal computer systems (e.g. grids, supercomputers, commercial clusters, etc.) continue to be developed to tackle sophisticated computing jobs. One such multi-nodal parallel computer being developed by International Business Machines Corporation (IBM) is the Blue Gene system. The Blue Gene system is a scalable system with 65,536 or more compute nodes. Each node consists of a single ASIC (application specific integrated circuit) and memory. Each node typically has 512 megabytes of local memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each. Each node board has 32 processors and the associated memory for each processor. As used herein, a massively parallel computer system is a system with more than about 10,000 processor nodes.
These new systems are dramatically changing the way programs and businesses are run. Because of the large amounts of data needing to be processed, current systems simply cannot keep up with the workload. The computer industry is more and more using distributed capacity or distributed computing. An application or sometimes a part of an application is often referred to as a “job”. In distributed computing, a job may be broken up into separate run time units (referred to herein as processing units) and executed on different nodes of the system. The processing units are assigned to a node in the distributed system by a job scheduler or job optimizer.