Organizations such as on-line retailers, Internet service providers, search providers, financial institutions, universities, and other computing-intensive organizations often conduct computer operations from large scale computing facilities. Such computing facilities house and accommodate a large amount of server, network, and computer equipment to process, store, and exchange data as needed to carry out an organization's operations. Typically, a computer room of a computing facility includes many server racks. Each server rack, in turn, includes many servers and associated computer equipment.
Because the computer room of a computing facility may contain a large number of servers, a large amount of electrical power may be required to operate the facility. In addition, the electrical power is distributed to a large number of locations spread throughout the computer room (e.g., many racks spaced from one another, and many servers in each rack). Usually, a facility receives a power feed at a relatively high voltage. This power feed is stepped down to a lower voltage (e.g., 110V). A network of cabling, bus bars, power connectors, and power distribution units, is used to deliver the power at the lower voltage to numerous specific components in the facility.
From time to time, elements in the power chain providing power to electrical systems fail or shut down. For example, if a power distribution unit that provides power to electrical systems is overloaded, an overload protection device in the power distribution unit (for example, a fuse or breaker) may trip, shutting down all of the electrical systems that are receiving power through that line of the power distribution unit.
When a breaker protecting a branch of a power distribution system has tripped, power may be lost to all the electrical systems that receive power from that branch until the breaker has been reset. Maintenance personnel typically need to physically go the rack to restore service to electrical systems (for example, by addressing the overload condition and manually resetting the breaker). The down-time associated with troubleshooting and correcting rack power distribution unit faults and shut downs may result in a significant loss of computing resources. In some critical systems such as hospital equipment and security systems, down-time may result in significant disruption and, in some cases, adversely affect health and safety.
In many cases, moreover, a circuit breaker may trip at a relatively high level in the power distribution chain (for example, a circuit breaker may trip in a floor power distribution unit that supplies power to an entire rack, even though the source of the fault is limited to one load (for example, a short in a single power supply unit in a single server.). Thus, the zone of impact of fault condition (for example, the number of computing devices taken down) may extend well beyond the location of any particular fault.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.