1. Technical Field
This invention generally relates to data processing, and more specifically relates to apparatus and methods for predicting failures in networked computer systems and acting on the predicted failures.
2. Background Art
Electronics and computers have greatly enhanced worker productivity in our modern information age. Much attention and many resources have been directed to making electronic and computer systems more reliable. When an electronic component or system fails, it can lead to system-wide failures that can cost companies millions of dollars. In such a context, it became necessary to prevent system failures.
Early attempts at preventative maintenance simply replaced certain components after a specified period of time in use. While this approach is effective if the time periods are selected with care, it is also very expensive, and leads to replacing components that do not need to be replaced. Some individual components can perform without problems for much longer than the mean time of operation without failures. By replacing components that do not need to be replaced, the cost of maintenance becomes excessive.
Another approach to preventative maintenance is to monitor components and replace a component when its operating parameters indicate that it may fail soon. This approach was pioneered by IBM in the early 1990s, and led to a concept referred to as Predictive Failure Analysis (PFA). Predictive failure analysis was first applied to hard disk drives. PFA status is in a normal state when the disk drive is operating correctly, and is in a “tripped” state when the PFA in the drive indicates that a failure will occur soon. A hard drive that has PFA capability monitors its internal functions, and indicates when the functions are outside of predefined limits by “tripping” a signal that indicates that the disk drive is about to fail. For example, a PFA status may be tripped if the fly height of a head is outside of specified limits, or if the error rate in the hard disk drive exceeds a specified limit. By indicating via the PFA status on a disk drive that a failure will likely happen soon, the system administrator has enough time to copy the contents of the disk drive to a backup source, replace the drive, and write the data from the backup to the new drive. PFA is thus an important tool that allows replacing a disk drive that may fail soon without loss of data.
Recognizing the value of predicting failures in disk drives, some competitors of IBM have implemented disk drives that have a S.M.A.R.T. interface, which stands for Self Monitoring And Reporting Technology. The S.M.A.R.T. interface is a specification of a set of registers in a device that contains information relating to the device's operation. No details are provided regarding the specific types of measurements that should be made or the values that indicate an impending failure. For this reason S.M.A.R.T. compatible disk drives are much less sophisticated that IBM disk drives that include Predictive Failure Analysis.
Predictive Failure Analysis has been implemented into components such as disk drives and printers. Communication of information relating to predicted failures has been limited so far to the box-level of a computer system, which means that a component inside a computer reports predictive failure information within its own box, but this information has typically not been used or communicated outside of a particular computer system. With the popularity of computer networks, it would be useful to share predictive failure analysis information between computer systems on a network. Furthermore, by detecting when certain computer systems may fail, it may be possible to re-route a network request to avoid a computer system or network path that may fail according to its predictive failure information. Without an apparatus and method for communicating predictive failure information between computer systems on a network and for dynamically rerouting a network request to avoid computer systems and network paths that may fail, the computer industry will continue to suffer from predictive failure information that is isolated within a system, with the result that failures that were predicted in individual computer systems may cause errors in inter-system communications over the network.