1. Technical Field
This invention generally relates to data processing, and more specifically relates to apparatus and methods for predicting failures in networked computer systems and acting on the predicted failures.
2. Background Art
Electronics and computers have greatly enhanced worker productivity in our modem information age. Much attention and many resources have been directed to making electronic and computer systems more reliable. When an electronic component or system fails, it can lead to system-wide failures that can cost companies millions of dollars. In such a context, it became necessary to prevent system failures.
Early attempts at preventative maintenance simply replaced certain components after a specified period of time in use. While this approach is effective if the time periods are selected with care, it is also very expensive, and leads to replacing components that do not need to be replaced. Some individual components can perform without problems for much longer than the mean time of operation without failures. By replacing components that do not need to be replaced, the cost of maintenance becomes excessive.
Another approach to preventative maintenance is to monitor components and replace a component when its operating parameters indicate that it may fail soon. This approach was pioneered by IBM in the early 1990s, and led to a concept referred to as Predictive Failure Analysis (PFA). Predictive failure analysis was first applied to hard disk drives. PFA status is in a normal state when the disk drive is operating correctly, and is in a xe2x80x9ctrippedxe2x80x9d state when the PFA in the drive indicates that a failure will occur soon. A hard drive that has PFA capability monitors its internal functions, and indicates when the functions are outside of predefined limits by xe2x80x9ctrippingxe2x80x9d a signal that indicates that the disk drive is about to fail. For example, a PFA status may be tripped if the fly height of a head is outside of specified limits, or if the error rate in the hard disk drive exceeds a specified limit. By indicating via the PFA status on a disk drive that a failure will likely happen soon, the system administrator has enough time to copy the contents of the disk drive to a backup source, replace the drive, and write the data from the backup to the new drive. PFA is thus an important tool that allows replacing a disk drive that may fail soon without loss of data.
Recognizing the value of predicting failures in disk drives, some competitors of IBM have implemented disk drives that have a S.M.A.R.T. interface, which stands for Self Monitoring And Reporting Technology. The S.M.A.R.T. interface is a specification of a set of registers in a device that contains information relating to the device""s operation. No details are provided regarding the specific types of measurements that should be made or the values that indicate an impending failure. For this reason S.M.A.R.T. compatible disk drives are much less sophisticated that IBM disk drives that include Predictive Failure Analysis.
Predictive Failure Analysis has been implemented into components such as disk drives and printers. Communication of information relating to predicted failures has been limited so far to the boxlevel of a computer system, which means that a component inside a computer reports predictive failure information within its own box, but this information has typically not been used or communicated outside of a particular computer system. With the popularity of computer networks, it would be useful to share predictive failure analysis information between computer systems on a network. Furthermore, by detecting when certain computer systems may fail, it may be possible to reroute a, network request to avoid a computer system or network path that may fail according to its predictive failure information. Without an apparatus and method for communicating predictive failure information between computer systems on a network and for dynamically rerouting a network request to avoid computer systems and network paths that may fail, the computer industry will continue to suffer from predictive failure information that is isolated within a system, with the result that failures that were predicted in individual computer systems may cause errors in inter-system communications over the network.
According to the preferred embodiments, an apparatus and method shares predictive failure information between computer system in a computer network. The shared predictive failure information allows dynamically rerouting a network request to avoid a computer system that may fail according to its predictive failure information. According to a first embodiment, if the requested resource on the network has predictive failure information that indicates the resource may soon fail, a message is returned to the requesting computer with information that includes possible alternative sites from; which the information may be obtained. If there is an alternative site, the requesting computer system may access the alternative site, thereby avoiding the computer system that may soon fail. If there is no alternative site, the requesting computer system may return an error message, or may simply access the original resource on the chance that is has not yet failed. According to a second embodiment, a router in the network may indicate one or more alternative paths to a resource if the predictive failure information for the router indicates it may soon fail. The requesting computer system may then access the requested resource via the alternative path. In this manner, predictive failure information can be used in rerouting network traffic between computer systems on a network to minimize the effect of a failing computer system.
The foregoing and other features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings.