1. Field of the Invention
The present invention relates to techniques for detecting impending problems in computer systems. More specifically, the present invention relates to a method and apparatus for predicting the remaining useful life of a system component or a computer system.
2. Related Art
For many safety-critical applications of computers, it is not sufficient to know whether a component, a group of components, or a computer is healthy or at risk; the user also needs to know the “remaining useful life” (RUL) for the components or the computer with a high confidence factor. RUL estimation capability is important, for example, in scenarios such as the following. Suppose one is planning a mission-critical operation (for example a battle situation) that may last 72 hours. Before committing an asset, plus one or more human lives to the operation, one needs to know if the RUL of all computers aboard the asset is longer than 72 hours, and it is useful to know this with a quantitative confidence factor.
In addition to being crucial for life-critical applications, RUL estimation is also beneficial for many commercial applications which use enterprise servers. For example, consider a scenario where a server at a customer data center starts issuing warning flags in the middle of a busy work day. In this situation, the account team would likely want to know if the problematic field replaceable unit (FRU) needs to be swapped as soon as possible, or if the server could continue operating until a scheduled shutdown on Saturday night. RUL estimation capability could add significant return on investment in such situations.
Currently, a commonly used technique for assessing the reliability of a system component or a computer system is to estimate a mean-time-between-failure (MTBF) for the system component or the computer system. However, an MTBF estimation is a fairly crude measure that provides little insight into how long a computer system or a system component is likely to continue operating based on the current operational state of the computer system.
Another existing technique which can be used to provide RUL predictions for a system component involves directly monitoring the operation of individual system components. However, while this technique can provide an accurate RUL measurement for a single component, it is not always feasible to apply this technique to a large number of components. Furthermore, it is also difficult to make accurate predictions for a set of components or a system based on the measurements from a few components.
Hence what is needed is a method and a system which can provide users with an accurate RUL prediction for a component or a computer system without the above-described problems.