Organizations such as on-line retailers, Internet service providers, search providers, financial institutions, universities, and other computing-intensive organizations often conduct computer operations from large scale computing facilities. Such computing facilities house and accommodate a large amount of server, network, and computer equipment to process, store, and exchange data as needed to carry out an organization's operations. Typically, a computer room of a computing facility includes many server racks. Each server rack, in turn, includes many servers and associated computer equipment.
Because the computer room of a computing facility may contain a large number of servers, a large amount of electrical power may be required to operate the facility. In addition, the electrical power is distributed to a large number of locations spread throughout the computer room (e.g., many racks spaced from one another, and many servers in each rack). Usually, a facility receives a power feed at a relatively high voltage. This power feed is stepped down to a lower voltage (e.g., 110V). A network of cabling, bus bars, power connectors, and power distribution units, is used to deliver the power at the lower voltage to numerous specific components in the facility.
From time to time, elements in the power chain providing power to electrical systems fail or shut down. For example, if a power distribution unit that provides power to electrical systems is overloaded, an overload protection device in the power distribution unit (for example, a fuse or breaker) may trip, shutting down all of the electrical systems that are receiving power through that line of the power distribution unit.
In many data centers, a system operator's first hint that a power distribution unit has failed or shut down is simply a message that one or more servers (or a rack of servers) are down or that communication from the server has been lost. Based on the “server down” or “rack down” message, the operator may have no immediate way of knowing the reason for the failure. For example, the processor on the server or an I/O component on the server(s) may have failed, the server may have been accidently disconnected, etc. As such, maintenance personnel may need to physically go the server in question and determine the source of the problem. The down-time associated with troubleshooting rack power distribution unit faults and shut downs may result in a significant loss of computing resources. In some critical systems such as hospital equipment and security systems, down-time may result in significant disruption and, in some cases, adversely affect health and safety.
Moreover, in many cases, the ultimate user of computing operations on a server may not even have physical access to the server or to elements in the power chain for the system. For example, customers of computing services conducted at co-location facilities may not have immediate physical access to the computing facility, and therefore must rely on the co-location operator to take corrective action on a power distribution failure or shut down.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.