The emergence of the cloud for computing applications has increased the demand for off-site installations, known as data centers, that store data and run applications accessed by remotely connected computer device users. A typical data center has physical chassis structures with attendant power and communication connections. Each rack may hold multiple network devices such as computing and storage servers and may constitute a multi-node server system.
A conventional multi-node chassis server system typically includes a chassis management controller, a plurality of computing nodes, a cluster of hard disks (termed the storage node); a cluster of all of the power supply units (PSU) on a power distribution board (PDB); and a midplane to connect all the functional boards. Each of the computing nodes can include a baseboard management controller (BMC), a platform controller hub (PCH), and one or more central processing units (CPU). The BMC manages power and operating parameters for the node. A chassis management controller (CMC) can be provided to communicate with the BMC of each node by Intelligent Platform Management Interface (IPMI) commands. The CMC will get information relating to the multi-node system to control or monitor the power supply units on the PDB.
The power supply units supply electrical power to an entire multi-node chassis server system. The primary function of a power supply unit is converting electric power from an AC source to the correct DC voltage and DC current for powering components on the server system. The power from the power supply unit is supplied via mechanical components, such as cables, to other server system boards, such as those for computing nodes, storage devices, and fans.
One effect occurring with a multi-component chassis is a temperature rise generated by large currents flowing through mechanical components to the nodes. The temperature rise is generated primarily from connectors or cables that have larger electrical contact or conductive resistance. According to the Joule effect, when large currents flow through mechanical components, the temperature will increase. Such temperature rises cause plastic aging and insulation recession in connectors and cables, thereby resulting in damage or burnout of the server system.
In prior server system designs, more mechanical components will be used to meet high-current design specifications (such as a system full loading current rate and a temperature rise of less than 30 degrees) to compensate for the effects of temperature rise. The standard response to protection against temperature rise is over-designing mechanical components for reliability. Such overdesign results in more expensive components.
All mechanical components carrying current in normal use have a resistance. Current passing through the mechanical components causes a voltage drop and thus a temperature rise. The voltage drop is a power loss equal to the product of voltage drop and current flow. Thus, the voltage drop, V, may be calculated by V=I×R; where V=the voltage drop across a connector or cable, I=the system loading current, and R=the resistance of the connector or cable. The power loss, P, may be calculated by P=V×I=IR×I=I2R, where P=the power loss of the system.
The PSUs compensate for the loss of power from voltage drops caused by the temperature rise of mechanical components by reading a remote sensing signal to determine the voltage drop. Therefore, in known power systems, the output of a PSU is increased to a higher voltage level by adjusting a feedback signal from the remote output voltage of the PSU. As a result, the current will be reduced after the system voltage is increased, thereby reducing the temperature rise effect of the mechanical components of the power system. As a result, the lifetime of these components is extended.
In system design, de-rating is an intentional process that applies to every component of a server system to reduce the opportunity of a component witnessing more stress than it is capable of withstanding. Based on de-rating considerations, the mechanical components selected (such as the lower number of American Wire Gauge [AWG] ratings) must meet the system design requirements (e.g., full loading current, voltage level, . . . etc.). The relevant document for assessing temperature rises is the EIA 364 D: TP-70B paper, titled “Temperature Rises vs. Currents of Electrical Connectors and Sockets” (June 1997), published by the Electronic Components Industry Association (ECIA). As explained in this paper, the current rating is based on the temperature rise of a connector under current flow. The temperature rise is defined as the difference between the ambient temperature and the hottest point, the hot spot, on the energized contact. The most common temperature rise criterion is a 30-degree Centigrade difference. FIG. 1 is a graph showing conventional temperature rise charted against current per contact. The graph shown in FIG. 1 includes a curve 10 that represents temperature rise in relation to current for a four pin power assignment and a curve 12 that represents temperature rise in relation to current for a six pin power assignment.
FIG. 2 is a resistance curve graph for a conventional connector. The graph in FIG. 2 shows a line 20 that represents the resistance of the connector over time. Aging is defined as the cumulative effects that occur over time to mechanical components. If unchecked, these effects can lead to loss of functionality and a potential reliability issues. The effects may be charted by a power connector datasheet or vendor provided data of power cycle to DC resistance. Although the effects of temperature rise are generally known, there is no way to predict when such effects will impede the operation of mechanical components, resulting in system failure.
Power supply units convert the AC voltage to the DC voltage according to the system design, and a remote sensor compensates the output for the sensed voltage drop. The output voltage of the power supply unit is guaranteed to meet certain upper and lower limit values of a predetermined operational zone. For example, a 12 V power supply unit may have a typical output of 12 V, a minimum output of 11.4 V, and a maximum output of 12.6 V.
The conditions for over-voltage protection are generally detected locally. The power supply generally shuts down in a latch-off mode upon an over voltage condition on the DC output. This latch may be cleared by toggling the PSON signal or by an AC input re-cycle/re-plug. The PSU output voltage levels are measured at the pins of PSU card edge receptacle with minimum and maximum output loads. Traditional designs of power sensing and feedback do not detect power connector status and predict for the power connector thermal aging and life. Thus, prior art systems suffer from the effect of repetitive transients on the insulation lifetime and dielectric capabilities. In the past designs, there is no detection or monitoring of power connector temperature rise and subsequent voltage drop. Therefore, the system does not detect the effect of temperature rises on the mechanical components.
Thus, there is a need for feedback voltage drop reporting across all nodes of a multi-node system at a particular node for detecting temperature rises in mechanical connection components. There is a further need for a system that allows adjusting power to address temperature rise effects in mechanical connection components. There is a further need for a detection system to predict when mechanical connector components may fail because of temperature rise effects. There is also a need for an intelligent neural network to determine optimal values to address temperature rise effects and provide data to predict the failure of mechanical components from temperature rise effects.