1. Field of the Invention
The present invention generally relates to server maintenance and more specifically to predicting server failure based on historic server failures.
2. Description of the Related Art
Servers have become the heart and soul of modern computing infrastructures. Most businesses maintain servers to perform business functions, as well as to provide core Information Technology (IT) services. For example, an e-commerce business may maintain a server containing the business website. The e-commerce server may perform business functions including displaying products, handling online orders and inventory management. The server may also perform critical IT functions including email, file storage, print and database services. Because such businesses are highly dependent on the proper functioning of such a server, the reliability of the server becomes critical to ensure the smooth running of the business.
However, servers are inherently prone to failure, which may be caused by both hardware and software. The loss of service and fault correction costs associated with such failure may prove to be very expensive for users of high end servers where customer demand is high and incessant. Therefore, it is necessary to identify and understand server failures, and correct them before they occur.
Server failures fall into two categories: predictable failures and unpredictable failures. Predictable failures are characterized by the degradation of an attribute over time, resulting in eventual server failure. It may be possible to make a reasonably accurate prediction of threshold values at which server failure may occur. Therefore, it may be possible to avoid server failure by monitoring attribute values and taking corrective measures as the values approach a predetermined threshold.
Mechanical failures, which account for sixty percent of hard disk failures, are typically considered predictable. Monitoring the physical attributes of components may therefore facilitate failure prediction and prevention. For example, it is possible to monitor, in real time, attributes of a hard disk such as disk spin time, temperature, distance from head to disk, etc. If values for these attributes approach threshold values, a user may be prompted with a warning to take corrective measures such as backing up data and replacing the disk.
However, because software does not fatigue, wear out, or burn out, software failures may be more difficult to predict. Software problems, unlike hardware problems, tend to be event or input driven rather than time driven. Furthermore, software problems may be much more complex than hardware problems. Some common causes of software problems include software design flaws, unexpected or mishandled events, corrupt data etc.
While current forecasting approaches can predict the number of faults expected for a software server, these approaches are not able to predict when such faults are likely to appear. Therefore, they provide no solutions for preventing software failures. Moreover, predicting software failures may require developing a set of constraints for a particular software configuration. This may require the constraints to be found within the complicated code of the software. However, the high rate of software changes (software updates, for example), may require this tedious analysis to be performed at each change, which may be impractical.
Therefore, what is needed is a method and system for predicting software server failures before they happen.