With the increase in electronic commerce, enterprise computing systems are becoming more complex and increased reliance on such computer systems is expected by businesses and organizations. Thus, any failure of a computer system can be very costly in terms of time and money and can undermine consumer confidence in a business to which the failed computer system belongs. However, the conventional maintenance approaches exploit idle or low load periods for replacement of identified faulty components and provide for major limitations in their performance.
For example, the conventional ways to schedule a system maintenance requires the knowledge of the exact form of the degraded failure rate of a given computer system. Such information is not only nearly impossible to obtain in practice, but also can be changing after each replacement of various system components. Furthermore, the conventional approaches for monitoring physical variables such as temperature, voltage, current, revolutions per minute (RPM), etc., in computer servers is performed via threshold limit rules that generate an alarm condition if a variable level starts to go out of specification. A threshold crossing event triggers a maintenance action. However, such threshold limit rules suffer from high false and/or missed alarm rates that significantly diminish the value of preventive maintenance. Also, the conventional approaches rely on passive fault detection in computer systems and typically, they do not actively probe the condition of electronic components.
FIG. 1 illustrates a prior art scheme 100 for providing schedule-based computer server maintenance. The illustrated traditional scheme 100 provides for a computer maintenance system for server 102 that follows a pre-specified maintenance schedule 102 (e.g., change a server cooling air filter every 3 months). During server exploitation, when a server failure 112 occurs for server 102 or a pre-specified maintenance action is due 114 according to the pre-specified maintenance schedule 104, maintenance action selector module 108 receives and processes a request for maintenance. This scheme is known as the conventional schedule-based maintenance strategy. One of the prominent drawbacks of this scheme 100 is that in numerous situations even when the computer system does not require any maintenance, the maintenance that is due 114 (but not required) is performed, which adds to the cost of maintaining the computer system and leads to additional system failures that are maintenance-induced.