Large computer systems or large computing installations typically include several components integrated with one another to cooperate in a combined manner. Physically, this often involves multiple cabinets (also referred to as “racks” by some organizations) in a large scale computing center, with each cabinet supporting several different computer cards. Each card may include a processor or multiple processors, and are typically networked with one another. As is well recognized, the individual processors are often referred to as nodes, with several nodes contained within various slots of a cabinet. In certain circumstances, and particularly in the high performance computing environment, the coordinated operation of these multiple nodes, slots, cabinets and/or systems will help the overall system operate more efficiently. As one example, system events typically need to be coordinated so the various nodes will cooperate in an effective manner.
While coordinated operation of large scale systems or large installations is necessary for effective operation, overall power control becomes a significant consideration which is not typically monitored. Often, a “power on” or “power off” cycle involves all components and/or processors transitioning from one power state to another. Most often, this is done simultaneously without concern for any potential adverse effects. Similarly, system boot up operations (which often requires additional processing power) are also often carried out without concern for collective adverse effects. When multiple systems are involved however, especially large scale systems involving many different cabinets, card slots, processors, etc., the overall cumulative power effects can be significant.
As will be appreciated, rapid increases or decreases in power consumed by large scale systems can cause problems. Due to the number of systems involved, the collective effect can create megawatt-scale power fluctuation in very short periods of time (e.g. multi-megawatt changes in less than a second). This has the potential to create problems in the local power systems, the infrastructure (e.g. cooling systems), the power grid, and with other power related systems. In some instances, this negatively affects economic conditions as well when the power utility company can potentially increase rates for high power demand customers, or a violation of the customer/utility power contract may be created. Related stresses on the power infrastructure can also cause service failures, power outages, and other negative effects. These problems are largely due to the inability of the power system to handle large swings or large fluctuations in a power demand in a very short period of time. In some instances, this may include multi-megawatt fluctuations in minutes or seconds. Again, this potential for rapid increases or decreases in power demand typically occurs in large installations or large systems due to the number of components involved. These problems may not be readily apparent since individual processor or individual system operations are often considered in isolation. As such, there is a need to consider the cause of power swings and the cumulative effects in large scale systems.
In many instances, software entities control power to some level or some degree. Again, individually this is very acceptable and does not create issues. That said, the collective effect in large scale systems can be detrimental and undesirable. This is especially true when no overall system control is provided. Due to the typical operation of these software entities, this often creates bulk changes in component power states, thus generating significant power swings. Examples of this include the power cycling of all nodes at one particular time, the simultaneous powering of all slots, and/or the boot up of a majority of nodes. Again, in each of these instances, when considered in a large scale system has the potential to cause significant power swings in a very short period of time.
In light of the above recognized possibilities for severe power swings, there is thus a need to provide some level of oversight and overall coordination. More specifically, a supervisory system is necessary to coordinate the operations of a large scale computing systems to avoid undesirable operating conditions. Specifically, there is a need to avoid severe power swings and very significant changes in power consumption over short periods of time. This includes the need to avoid significant increases and/or decreases in power over short periods of time, since both can create problems.