In large scale computing systems, the operations and activities of the system as a whole can potentially create various undesired effects. For example, the simultaneous start of many different nodes in such a system can create an undesired spike in power needed to carry out operations. Further, during an application run, huge swings in overall power consumption can be observed. Due to the cumulative effect of multiple components or nodes of a large scale system operating simultaneously, the magnitude of these swings can be severe enough to cause a variety of issues and concerns.
Typically, prior to the launch of a computing application, the assigned compute nodes are idle. In most situations, idle nodes have been tuned to consume as little power as possible. At launch however, these nodes go from minimal power consumption to maximum power consumption, nearly instantaneously. In large installations, with many nodes and a wide variety of applications, the magnitude of this sudden increase in power over a short period of time can create a heavy drain on the power source. In some extreme cases this large need for instantaneous power to support a large application launch can potentially lead to system failure. In certain facilities and/or sites, the system operator may also be contractually obligated to minimize their rate of change in power consumption over time, in order to allow their service provider to maintain a certain quality of service for all of its customers. The above mentioned spike of power needed at application launch can potentially put this contractual obligation at risk.
In these large installations, it is thus beneficial to avoid or minimize large spikes or dramatic changes in power consumption for at least the reasons outlined above. Given the large number of devices included in these installations however, this control requires consideration of the collective operating power demands for all components involved.
Generally, there are at least three spans of an application life cycle that can present possibilities for undesired spikes in power—launch, runtime and exit. At application launch, the target node set can go from minimal power consumption to maximal power consumption over a very short period of time. As suggested above, application runtime is a second possible portion of the application lifecycle where undesired power spikes can occur. For example, massive swings in power consumption can occur during synchronization, where constituent parts of the parallel applications often race towards synchronization barriers, and then wait for the rest of the application to catch up. After catching up, the application will then proceed, thus requiring some amount of power to carry out the required tasks. No work is being done at these synchronization barriers so minimal power is consumed. The rate at which the constituent parts of the application reach the barrier however, could be significant enough to cause power ramp rate issues. These issues are even more likely at the point in time when all parts reach the barrier and are released in unison. At this point power consumption can instantaneously jump from near minimum to near maximum.
Another possible runtime circumstance where power ramp rate issues (or undesirable abrupt power swings) may occur is when many parts of applications might stall waiting for blocking I/O. In this circumstance, once I/O is actuated, all operations can potentially proceed instantaneously. Again, the collective operation of many nodes carrying out this same process can create undesired power ramp rate issues.
Additional runtime circumstances or conditions exist which also have the potential to create concerns. Some of these may include cases where a debugger is in use, processes utilizing breakpoints, single stepping execution, etc. In addition, there's nothing to prevent an application writer from voluntarily suspending parts of the application for whatever reason, at any point in time. This will obviously create potential ramp rate issues due to collective starting and stopping of application steps.
Again, application exit presents yet another potentially problematic situation, as the target node set goes from maximum power consumption to minimum power consumption. This is generally the third and last step in an application life cycle. In addition to the normal termination of an application, abnormal termination due to programmatic errors, system violations, or general system failures create identical concerns, and have the potential to create similar abrupt swings in power.
As the above situations and examples illustrate, there are several possible operations that can create undesired spikes or abrupt transitions in the overall power being supplied to large scale computing systems. Again, the effect of these operations or activities when considered individually may not create issues or concerns. When the collective effects are considered however, the power related concerns are greatly amplified during operation of a large scale computing system.