Generally, a High Performance Computing (HPC) system performs parallel computing by simultaneous use of multiple nodes to execute a computational assignment referred to as a job. Each node typically includes processors, memory, operating system, and I/O components. The nodes communicate with each other through a high speed network fabric and may use shared file systems or storage. The job is divided in thousands of parallel tasks distributed over thousands of nodes. These tasks synchronize with each other hundreds of times a second. Usually an HPC system consumes megawatts of power.
Conventional high performance computing (HPC) systems HPC and other big data systems are agnostic to power. A top HPC system consumes about 20 Mega watt (MW) power delivering 33petaflops (PF) of performance. This performance is expected to grow at about an exponential rate while available power is expected to stay at or below about 20 MW. Typically, power allocation is not likely to be 20 MW and may change as often as every 15 minutes.
An existing HPC job scheduler cannot limit the HPC job power with deterministic performance. A typical job scheduler simply sets a power cap for a job. Nodes of the HPC system running the same job may run at different frequencies resulting in imbalance and undeterministic behavior.
Currently, the job's power cap is fixed, even though the facility power allocation may change, some jobs may be completed, and some jobs may be suspended. The current HPC systems do not dynamically change the job's power cap based on facility power limit and suspended job priority.
In conventional HPC systems, a system level power limit is achieved by limiting power to jobs. Typically, a computation work is divided into thousand of chunks and is distributed to thousands of nodes. These nodes synchronize with each other hundreds of times a second before making a forward progress. A slowest node in the system makes all other nodes to wait. The traditional approach to address this challenge is to run all nodes at the same frequency. Based upon computation the power consumed by nodes can go up and down. In conventional HPC systems, to ensure that the job does not consume more power than the power allocated for that job it is assumed that all nodes will consume maximum power and a lowest frequency for all nodes is selected. However, this means that some of the nodes in the system need to operate at a reduced frequency even if the system has a power headroom. In conventional system a job is not using all the power allocated or reserved for that job. This allocated and unused power is called stranded power. The non-zero stranded power is a waste of critical and scarce energy resources.