Field of the Invention
The present invention relates to methods for extending time to failure for components in a server.
Background of the Related Art
A datacenter may contain and facilitate the use of a large number of computer servers or compute nodes. Each compute node includes a large number of individual components that support the compute node in performing a workload. The overall capacity of a compute node is a function of the capacity and the number of the individual components.
When one of the individual components fails or experiences a high error rate, the overall capacity of the compute node declines. At some point it may be necessary to replace the damaged component in order to regain the full capacity or functionality of the compute node. Such replacement causes the compute node to be taken out of service for a period of time and imposes a cost of the replacement component and a cost of labor to replace the component.
One approach to reducing component failures is to design more robust components having an enhanced reliability and an extended life. However, such components are generally more expensive and the system within which the component is installed will generally become obsolete after a period of years. Another approach is to provide redundant components so that a failure does not lead to loss of data or system downtime. However, the extra components needed to provide redundancy similarly increase the cost of the system and a failed component must still be replaced in order to maintain the same level of redundancy.