Networks or clusters of computers are used for all types of applications in the modern world. In order for these clusters of computers to function efficiently and effectively, it is important that the individual computers that make up the cluster function properly. If any individual computer in the cluster unexpectedly fails, the effect on the cluster can be catastrophic and cascading. Once an error in the cluster has been detected, it is often possible to take corrective measures to minimize the harm to the overall functioning of the cluster. However, given the speed of modern business and the importance of certain computer clusters, even small amounts of down time can prove extremely costly. Therefore, it would be very advantageous to be able to predict errors or system failures and take corrective action prior to their occurrence.
Several techniques have been previously proposed in the literature for using proactive system management to improve the performance of computer clusters. Some of these techniques have included attempts to predict the occurrence of failures and the use of software rejuvenation. Successful prediction of errors in a computer system in particular offers the promise of enabling significantly improved system management. However, prior techniques for predicting errors have been unreliable and have had several other deficiencies that have prevented them from being widely accepted. Therefore, what is needed is an improved method of predicting the occurrence or errors in a computer cluster and transforming the system to minimize the impact of the predicted errors.