Organizations such as on-line retailers, cloud computing providers, Internet service providers, search providers, financial institutions, universities, and other computing-intensive organizations often conduct computer operations from large scale computing facilities. Such computing facilities house and accommodate a large amount of server, network, and computer equipment to process, store, and exchange data as needed to carry out an organization's operations. Typically, a computer room of a computing facility includes many server racks. Each server rack, in turn, includes many servers and associated computer equipment.
Because the computer room of a computing facility may contain a large number of servers, a large amount of electrical power may be required to operate the facility. In addition, the electrical power is distributed to a large number of locations spread throughout the computer room (e.g., many racks spaced from one another, and many servers in each rack). Usually, a facility receives a power feed at a relatively high voltage. This power feed is stepped down to a lower voltage (e.g., 110 volts). A network of cabling, bus bars, power connectors, and power distribution units, is used to deliver the power at the lower voltage to numerous specific components in the facility.
Some systems include dual power servers that provide redundant power for computing equipment. In some systems, an automatic transfer switch (“ATS”) device provides switching from a primary power system to a secondary (e.g., back-up) power system. In a typical system, the automatic transfer switch automatically switches the computing equipment to the secondary system upon detecting a fault in the primary power. To maintain the computing equipment in continuous operation, the automatic transfer switch may need to make the transfer to secondary power system rapidly (for example, within about 16 milliseconds).
In some systems, if the ATS device coupled to a rack system fails (for example, due to an overcurrent condition in the automatic transfer switch), the system may no longer be able to automatically switch to back-up power during a primary system failure. Monitoring and maintenance of ATS devices thereby can mitigate risks of such a loss of capability.
In some cases, ATS devices, as well as various other devices in a data center, may be physically monitored and manually maintained by technicians at the data center. For an organization that operates multiple data center devices, such as ATS devices, at multiple locations, such as multiple data centers, monitoring and maintaining all of the devices may require multiple technicians for the multiple locations and may be time-consuming, due to the quantity and distribution of the devices.
Furthermore, where a finite number of technicians are available, monitoring of a given device may occur infrequently, occasionally, etc. For example, an ATS device may be tested and monitored during installation of a new rack computing system, but otherwise may simply be visually inspected at other times. In addition, in some data centers, devices such as ATS devices may be installed underneath a raised floor, making visual inspection more difficult.
In some cases, maintenance is performed manually on a data center device, such as an ATS device. As a result, maintenance, such as firmware upgrades, may need to be implemented on each device one-by-one. For an organization that operates multiple data center devices at multiple locations, such manual maintenance may be time-consuming and costly.
In addition, with regard to many data center devices, such as ATS devices, health issues with an individual device, such as performance losses, absent physical monitoring by a physically-present technician, may become evident to a technician until the device fails to perform one or more functions, at which point an alarm may be generated from a device in a rack computing system, such as a Top of Rack (“TOR”) switch. Such an alarm may indicate a general performance issue associated with the rack computing system but may not indicate with particularity the nature of the failure, nor the particular cause, such that a technician arriving at the rack computing system may be required to manually access the failing device to diagnose the cause of the failure.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.