1. Technical Field
The disclosure and claims herein generally relate to computer process allocation and distribution on a multi-node computer system, and more specifically relate to dynamic resource adjustment of a distributed computer process on a multi-node computer system.
2. Background Art
Supercomputers and other multi-node computer systems continue to be developed to tackle sophisticated computing jobs. One type of multi-node computer system is a massively parallel computer system. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a high density, scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each rack.
Computer systems such as Blue Gene have a large number of nodes, each with its own processor and local memory. The nodes are connected with several communication networks. One communication network connects the nodes in a logical tree network. In the logical tree network, the nodes are connected to an input-output (I/O) node at the top of the tree. In Blue Gene, there are 2 compute nodes per node card with 2 processors each. A node board holds 16 node cards and each rack holds 32 node boards. A node board has slots to hold 2 I/O cards that each have 2 I/O nodes.
A distributed process is a computer application or program or portion of a computer program where one or more portions of the distributed process are allocated to different hardware resources. In a distributed process across many nodes a traditional program can be thought of as an execution of “processing units” that are dispersed and executed over multiple nodes. In this type of distributed environment, one is often unaware of what node a given processing unit is running Processing units are often detached from one another and may be unaware of where other processing units are running. In this type of distributed environment, adjusting priorities of processing units or adjusting compute resources is not a simple task. Simply moving compute resources around from node to node as a reaction to the current needs or current job priorities is simply inadequate. For example, in a distributed environment a piece of code or a processing unit may be executed on behalf of many different applications or jobs. In some cases, these processing units will have higher priority than others but in many cases they will not. Furthermore, an application may or may not have a consistent priority throughout its execution. In some cases, the priority of the application may be more appropriately determined by the data that it is handling or changes in the means and mechanisms needed to carry out the entire job.
Without an efficient way to allocate resources to processing units in a distributed computer system environment, complex computer systems will continue to suffer from reduced performance and increased power consumption.