1. Field
The present application relates to energy-aware resource management in a computer system, such as a high-performance computer system (HPC system).
2. Description of the Related Art
The invention addresses energy consumption in computer systems, for example the amount of energy consumed by programs running on large-scale computer clusters, particularly high performance computing (HPC) systems. HPC clusters consist of a large number of quasi-independent computers (nodes) that are coupled together and may be used to execute one large, monolithic application. Applications (or other jobs) are split into a number of communicating tasks; and these tasks are allocated to nodes by a Resource Manager.
The Resource Manager is responsible for the efficient running of the HPC system, making sure that resources, including nodes, (and any other resources) deliver a service to requirements. The over-riding service requirement is generally to complete the application as fast as possible but, since the nodes may be high-end computers and use a lot of energy, a secondary requirement, whose importance is rising, is to use as little energy as possible. The compute nodes have a variety of mechanisms that can be used to reduce the energy but these usually come at the cost of computational performance.
Increases in the capability of the microprocessors that form the core of the servers result in increased energy use. However, modern microprocessor designs also have the ability to reduce power consumption when the highest performance is not required. This is achieved through a variety of techniques, such as voltage and frequency scaling or turning off unused sections of a microprocessor. This capability is made available through the operating system to the Resource Managers so that they can reduce the power consumption of their cluster, but use of this capability must be balanced against the need to maintain performance of the applications executing on the cluster.
The overall performance of an application depends on a number of factors. Computational performance is important in many HPC applications, and these are said to be “compute-bound” or CPU-bound. Some applications are dominated by other factors, such as reading and writing data to memory or communicating data between nodes: these are said to be IO (input-output) bound. The performance of compute-bound applications is adversely affected by placing nodes into energy-saving states, but IO bound applications can have their nodes placed into energy saving states with minimal impact on performance. A Resource Manager has no a priori way of knowing if an application under its control is compute-bound or IO-bound and so cannot place any application into energy-saving mode whilst guaranteeing no reduction in performance. Applications may also pass through various phases during their execution and these phases can have different characteristics, one phase might be compute bound, but the following IO bound.
Many Resource Managers expose the ability to set the energy saving mode of a job to the user of the HPC cluster. This capability is not however always well used since the users may not be motivated to save energy (unless, for example, they are being charged for energy used in addition to other resources used). Also, users will not want to reduce the execution performance of their applications (i.e. they want to minimize the time to solution). Therefore, it is desirable for Resource Managers to automatically apply energy saving modes where it can be guaranteed that performance will not be affected.
The state of the art in such “energy-aware” Resource Managers is represented by US patent applications US 20120216205 and US 20090100437. These describe a Resource Manager with a database of the performance characteristics of previously submitted jobs and a computational model of the effect of placing a cluster into energy saving mode on a job with the given performance characteristics. If the Resource Manager can identify a job as the same as one previously submitted, it uses the recorded characteristics to compute the best energy configuration that does not impact performance.
The problem that these patent applications do not address is that there is no guarantee that a resubmitted job will have the same performance characteristics as the previous submission. Resource Managers cannot know from outside an application what the optimal settings are for energy efficiency. Even though the same application may have been submitted previously, there could have been changes not visible to the Resource Manager that affect the energy use characteristics. These changes include modifications to the code, selection of a different solution algorithm, a different mapping of load to compute nodes, a change to the problem size or a change to enabled options. All of these changes can affect the energy use characteristics of an application and the interaction between configurations and computational performance, so past experience with an application is not a reliable guide to the execution of the current application with the current initialization files.
Therefore, it is desirable to provide energy-aware execution which does not rely on previously submitted jobs.