The invention addresses energy consumption in computer systems, for example the amount of energy consumed by programs running on large-scale computer clusters, particularly high performance computing (HPC) systems. HPC clusters consist of a large number of quasi-independent computers (nodes) that are coupled together and may be used to execute one large, monolithic application. Applications (or other jobs) are split into a number of communicating tasks; a Resource Manager allocates these tasks to computers and can also manage the power configuration of the computer hardware to reduce the energy consumed.
The Resource Manager is responsible for the efficient running of the HPC system, making sure that resources, including nodes and any other resources, deliver a service to requirements. The over-riding service requirement is generally to complete the application as fast as possible but, since the nodes may be high-end computers and use a lot of energy, a secondary requirement, whose importance is rising, is to use as little energy as possible. The compute nodes have a variety of mechanisms that can be used to reduce the energy consumption, but these usually come at the cost of computational performance. Thus, each node will in general have a plurality of possible operating modes or “power configurations” which differ in their respective performance and energy usage.
The programs executed in High Performance Computing (HPC) have a number of features that distinguish HPC from other forms of large-scale computing, such as web-scale computing. One of these is that a HPC program can be thought of as one computation distributed across a large number of separate computers (in contrast, web-scale programs are, essentially, a large number of independent computations executed simultaneously on the separate computers). As any computation proceeds through the program phases or stages, the demands on the resources that it uses change. In other words, the demands on the hardware vary according to the current “state” of the application, by which is meant not only the stage of completion of the application but also the type of operation demanded at that time. Typically, execution of a program proceeds through initialization and iteration phases, repeating the same sections of code many times in the same execution, the resource demands being similar or identical at each repetition.
In web-scale computing, as the computations execute essentially independently, the changes in resource demand also occur independently with an averaging effect that results in the resource use pattern over the whole set of computers becoming fairly static. HPC programs, on the other hand, will show coordinated patterns of resource use change as the program changes to the same state simultaneously across all the occupied computers.
These differing patterns of resource use change impact the best design of Resource Managers. For web-scale programs, resource management decisions can be made local to the computers. However, for HPC-style programs resource management must occur at the global level in addition to local-level management.
Increases in the capability of the microprocessors that form the core of the computers result in increased energy use. However, modern microprocessor designs also have the ability to reduce power consumption when the highest performance is not required. This is achieved through a variety of techniques, such as voltage and frequency scaling or turning off unused sections of a microprocessor. The energy reduction decisions are made implemented locally in each computer, responding to the observed conditions of the code currently operating on the computer. Different computers may be in different energy and performance configurations (power configurations).
Performance and energy management therefore happens at two levels in HPC systems: across a program spanning multiple computers and locally to each computer. These management decisions operate at different timescales. Local management is fast, responding to short time changes. Global management time scales are longer as they are determined by the time that it takes to communicate information from the individual computers to the centralized Resource Manager program, by the time taken to process the gathered data to make a management decision, and by the time taken to distribute that decision to the computers.
Changing power configurations takes time, and poor coordination of power configuration changes across computers can have adverse effects on the performance of a HPC program so it is often the case that local power management is turned off or operated very conservatively, ensuring acceptable performance but missing many opportunities for energy savings. On the other hand, centralized management will not be distributed quickly enough to exploit all energy saving opportunities.
When distributing an HPC application over the nodes, each portion of the application is decoupled as much as possible from the other portions, although some communication between the nodes is always necessary. Communication leads to the nodes having to coordinate their execution as they wait for the arrival of messages from other nodes. The degree of coordination varies with application and algorithm type, but there are usually many times during an application's execution where nodes will wait for the completion of computations on other nodes. A number of these waits for coordination require global communications (i.e. across all nodes allocated to the application) and so any delays to the execution by one node will cause the whole application to wait for the slowest node to catch up—referred to as a global coordination barrier or simply “global barrier”.
To illustrate this, FIG. 1 shows execution phases of a simple program executing on four nodes A to D, with time represented by the length along the horizontal bars. The program has four phases, execute, wait, execute and wait. The wait dependencies are shown as arrows between the nodes. The first wait phase models a local coordination barrier since the nodes only communicate in pairs that is, A with B and C with D. The second wait models a global barrier; every node waits on all the other nodes before completion. This is labeled “Global Wait” in the Figure to indicate that all nodes (in general) must wait here. A node which is in the waiting phase may be said to have “entered” the barrier at a certain timing represented by the start of the wait phase.
As there are imbalances in the loads and run times over the computers, individual computers will reach the barrier at different times with the result that some computers will wait at the global barrier for other computers to catch up. As the waiting computers are not calculating, they can be placed into a low energy power configuration, with no effect on performance, until all computers are at the global barrier. However, late arrivals at the global barrier should remain in a high power/fast performance power configuration, since the time taken to achieve the configuration changes will slow the program's performance. This decision cannot be made locally as each computer does not know where it is in the arrival queue; therefore the decision must be made centrally, but as fast as possible to save as much energy as possible.
The time scales at the centralized management area (Resource Manager) are much longer than the local time scales. This is due to the time taken to receive enough messages to make effective decisions and time to process the messages, more particularly the time taken to execute optimization algorithms scaling with computer count and the time to access parts of data structures. As the number of computers in a cluster scales to 100,000 to 1,000,000, the time taken for any central management will become longer too.
The problem to solve is to make and distribute central decisions fast enough with large numbers of participating computers to exploit short-lived opportunities for energy savings during a computation.
Many Resource Managers expose the ability to set the energy saving mode of a job to the user of the HPC cluster. This capability is not well used since the users are not motivated to save energy (by, for example, being charged for energy used in addition to other resources used). Also, the users will not want to reduce the execution performance of their applications (i.e. they want to minimize the time to solution). Therefore, Resource Managers need to automatically apply energy saving modes where it can be guaranteed that performance will not be affected.
Similarly, the local operating systems have the ability to directly set the power configuration of their underlying hardware. This ability present in the hardware is exposed on most operating systems present on HPC clusters. Use of this capability is generally limited, i.e. it is usually turned off, since the locally determined changes to execution rate have unpredictable effects on the computation rate of a distributed application, which can accumulate across the computers to yield large overall delays in execution.
The state of the art in “energy-aware” Resource Managers is represented by US patent applications US 20120216205 and US 20090100437. These describe a Resource Manager with a database of the performance characteristics of previously submitted jobs and a computational model of the effect of placing a cluster into energy saving mode on a job with the given performance characteristics. If the Resource Manager can identify a job as being the same as one previously submitted, it uses the recorded characteristics to compute the best energy configuration that does not impact performance.
A problem not addressed in the prior art is that the optimal configuration for energy efficiency changes during an application's execution, often with high frequency. Setting a single configuration for the duration of an application is always a compromise, losing some performance and not achieving highest energy efficiency.
It has been proposed to distribute responsibility for management through a hierarchy of resource managers (e.g. CN103268261 and Argo, http://www.mcs.anl.gov/project/argo-exascale-operating-system). Distributing the computation and data collection load amongst a number of smaller components reduces the individual computation burden and so increases responsiveness. However, as each of the sub-managers only receives data from a limited subset of computers, it cannot ensure the optimal behavior of the whole system.
Therefore, it is desirable to provide energy-aware management of a distributed computer system which avoids the above problems.