1. Technical Field
This invention generally relates to data processing, and more specifically relates to apparatus and methods for predicting failures in computer systems.
2. Background Art
Electronics and computers have greatly enhanced worker productivity in our modern information age. Much attention and many resources have been directed to making electronic and computer systems more reliable. When an electronic component or system fails, it can lead to system-wide failures that can cost companies millions of dollars. In such a context, it became necessary to prevent system failures.
Early attempts at preventative maintenance simply replaced certain components after a specified period of time in use. While this approach is effective if the time periods are selected with care, it is also very expensive, and leads to replacing components that do not need to be replaced. Some individual components can perform without problems for much longer than the mean time of operation without failures. By replacing components that do not need to be replaced, the cost of maintenance becomes excessive.
Another approach to preventative maintenance is to monitor components and replace a component when its operating parameters indicate that it may fail soon. This approach was pioneered by IBM in the early 1900s, and led to a concept referred to as Predictive Failure Analysis (PFA). Predictive failure analysis was first applied to hard disk drives. PFA status is in a normal state when the disk drive is operating correctly, and is in a xe2x80x9ctrippedxe2x80x9d state when the PFA in the drive indicates that a failure will occur soon. A hard drive that has PFA capability monitors its internal functions, and indicates when the functions are outside of predefined limits by xe2x80x9ctrippingxe2x80x9d a signal that indicates that the disk drive is about to fail. For example, a PFA status may be tripped if the fly height of a head is outside of specified limits, or if the error rate in the hard disk drive exceeds a specified limit. By indicating via the PFA status on a disk drive that a failure will likely happen soon, the system administrator has enough time to copy the contents of the disk drive to a backup source, replace the drive, and write the data from the backup to the new drive. PFA is thus an important tool that allows replacing a disk drive that may fail soon without loss of data.
Recognizing the value of predicting failures in disk drives, some competitors of IBM have implemented disk drives that have a S.M.A.R.T. interface, which stands for Self Monitoring And Reporting Technology. The S.M.A.R.T. interface is a specification of a set of registers in a device that contains information relating to the device""s operation. No details are provided regarding the specific types of measurements that should be made or the values that indicate an impending failure. For this reason S.M.A.R.T. compatible disk drives are much less sophisticated that IBM disk drives that include Predictive Failure Analysis.
Predictive Failure Analysis has been implemented into components such as disk drives and printers. Thus, communication of information relating to predicted failures has been limited so far to the box-level of a computer system, which means that a component inside a computer reports predictive failure information within its own box, but this information has not been used or communicated outside of a particular computer system. With the popularity of computer networks, it would be useful to share predictive failure analysis information between computer systems on a network. Without an apparatus and method for communicating predictive failure information between computer systems on a computer network, the computer industry will continue to suffer from predictive failure information that is isolated within a system, with the result that failures that were predicted in individual systems may cause errors in inter-system communications over the network.
According to the preferred embodiments, an apparatus and method shares predictive failure information between computer systems in a computer network. The shared predictive failure information allows a requester of a network resource to determine whether the resource will be available to perform the request based on its predictive failure information. According to a first embodiment, predictive failure information is written by each computer system on the network to a common storage that is accessible by one or more other computer systems on the network. When a computer system on the network needs a resource on another system, the requesting computer system can check the predictive failure status of the system that contains the needed resource by reading the predictive failure information in the common storage. If the predictive failure information indicates that the resource may perform the requested function, the requesting computer system issues the request to the resource. In a second embodiment, one or more network protocols for communicating between computer systems on the network are modified so that messages given in response to resource requests include the predictive failure status of the requested system. Thus, if a requester needs data from another computer system, a message returned from that system in response to the request preferably includes predictive failure status or information indicating whether or not the request can be granted. If the predictive failure status or information indicates that the request can be granted, the requester performs the operation on the requested computer system. In this manner, predictive failure information can be used in granting access to resources between computer systems on a network, which allows accesses to be prevented if the predictive failure information indicates that the resource is likely to fail before completing the request.
The foregoing and other features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings.