A distributed computer system may perform parallel computing by the simultaneous use of multiple nodes to execute a computational assignment referred to as a job. Each node may include one or more processors, memory, an operating system, and one or more input/output (I/O) components. The nodes may communicate with each other through a high speed network fabric and may use shared file systems or storage. The job may be divided into thousands of parallel tasks distributed over thousands of nodes. These nodes may synchronize with each other hundreds of times a second.
Future distributed computer systems are projected to require tens of megawatts of power, making their power management a foremost concern in the industry. These distributed computer systems will be expected to deliver exascale performance with limited power and energy budgets. Current distributed computer systems may apply power capping to adhere to the limited power and energy budgets. However, regardless of power capping, the power allocation to a distributed computer system (“the system”) may be decreased such that the power being consumed by the system exceeds the power allocated to the system.
The management of currently running jobs, suspended jobs and newly requested jobs in a queue of the system is critical in maintaining expected performance of the system and ensuring power consumed by the system remains less than the power allocated to the system. As the power allocated to the system fluctuates, there may be a need to suspend and/or terminate one or more currently running jobs, resume one or more suspended jobs and/or start one or more new jobs already in the queue. In addition, the system may be required to adhere to priorities regarding power allocation to certain types of jobs.