The present invention relates to digital data storage devices, and in particular, to methods and apparatus for self-monitoring the condition of a digital data storage device.
The latter half of the twentieth century has been witness to a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
The extensive data storage needs of modern computer systems require large capacity mass data storage devices. While various data storage technologies are available, the rotating magnetic rigid disk drive has become by far the most ubiquitous.
As computer users have come to rely more and more on their machines, they have come to expect an ever higher degree of reliability from the computers, including each component thereof. In the realm of reliability, the data storage device occupies a special place. For, in most cases, more important than the continued operation and availability of the computer itself is the integrity of the data. In today""s marketplace, erratic or malfunctioning hardware components can often be replaced cheaply and easily. But data may be far more valuable. It is often the result of countless hours of human effort, and if lost, would require enormous resources to replace.
A disk drive data storage device is an extremely complex piece of machinery, containing precision mechanical parts, ultra-smooth disk surfaces, high-density magnetically encoded data, and sophisticated electronics for encoding/decoding data, and controlling drive operation. Each disk drive is therefore a miniature world unto itself, containing multiple systems and subsystem, each one of which is needed from proper drive operation, and the failure of any of which may cause the entire drive to malfunction. At the same time, the demands of the marketplace for increasing data capacity and faster data access require disk drive designers to push systems to their limits. Although enormous engineering resources have been devoted to the design of disk drives, and improvements over the years have been impressive, given the complexity of the designs themselves and the demands to which the drives are put, it is not surprising that disk drives can, and do, fail.
In order to avoid catastrophic loss of data stored on disk drive storage devices, users have resorted to various practices and devices. It is now well known to maintain data on multiple disk drives in a redundant form, using any of several types of disk drive collections commonly referred to as xe2x80x9cRAIDxe2x80x9d (Redundant Arrays of Independent Disks). RAIDs have many varying features and characteristics, but in general, a RAID has the capability to reconstruct data stored on any single disk drive in the event of a failure of that disk drive from the data stored on other disk drives in the RAID. It is also well known to frequently back up data to tapes, diskettes, or other storage media, so that in the event of a disk drive failure, only the data added since the last backup need be recovered. However, both RAIDs, and frequent backup have drawbacks in terms of consumption of hardware resources, impact on system performance, human intervention required for backup, etc.
In recent years, disk drive manufacturers have attempted to reduce the scope of this problem by including self-monitoring capability in disk drives, whereby a drive itself can predict that failure may be imminent. A user, being warned of imminent failure, can off-load the data to another storage device, and replace the disk drive about to fail with a new one. Such capability has the potential to reduce or eliminate the need for costly and time consuming back-ups or RAID systems. Furthermore, even where RAIDs, periodic back-ups, or other techniques are used, the capability to predict impending failure of a disk drive improves the robustness of the system, and makes scheduled maintenance easier and less costly.
Although the concept of self-monitoring capability has great potential, conventional self-monitoring systems are very limited. In general, these systems are encoded in the programming code of a disk drive controller. The disk drive controller program works by examining one or more operating parameters of the disk drive, and comparing these to some threshold(s) to determine whether the drive is nearing end of life. The problem with such an approach is that it requires the disk drive designer to have nearly perfect knowledge in advance of the common modes of disk drive failure, and the parameter thresholds that signal impending failure. It is possible to make certain broad generalizations about parameters that may signal problems with a disk drive. But when a new disk drive design is introduced, it is nearly impossible to say in advance what factors will take precedence, what thresholds will have greater significance, and how certain factors may interact with others. Typically, this information is only acquired after actual experience with a new disk drive design in the field, i.e., in actual use by customers. Furthermore, even when data is acquired from customer experience, it is not always obvious why drives are failing and what significance to accord various measured parameters.
This lack of foreknowledge places disk drive designers in a dilemma. If they measure too many parameters and establish too many thresholds or thresholds which are too low, many perfectly good disk drives may predict impending failure unnecessarily. On the other hand, if a parameter is ignored, it may turn out to be very significant in later experience.
A need exists for more accurate and improved self-monitoring capability in disk drives
In accordance with the present invention, a digital data storage device such as a rotating magnetic disk drive contains an on-board condition monitoring system. The condition monitoring system comprises a neural network coupled to multiple inputs, the inputs being derived from measured parameters of disk drive operation. The neural network computes one or more quantities representing disk drive condition as a function of the various inputs.
In the preferred embodiment, a configurable set of weights determines the significance accorded by the neural network to each respective input, alone or in combination with other inputs. The set of weights is stored in a configuration table, which can be overwritten by the host computer system. A disk drive is sold and installed with a default set of weights, based on the then existing knowledge of the disk drive designers. As the designers acquire a history of experience with actual field failures and other problems, the field data can be used to construct a new, more accurate, set of weights. This new set of weights can then be propagated to existing disk drives in the field by simply writing the weights to the configuration tables of the disk drives, without altering disk drive control code or other disk drive features. It is also possible to propagate hidden node functions in the same manner.
Preferably, the disk drive designers include any measurable parameter which might conceivably be useful in predicting failure as an input to the neural network, even if the designers believe at the time of initial design that the parameter has no significance. In this case, the designers can assign the parameter a weight of zero during initial release. If subsequent experience then shows that the parameter has some significance not predicted by the designers, the self-monitoring neural network can be corrected simply by changing weighting factors, without any alteration to the control programming code.
The self-monitoring method and apparatus described herein described herein provides more accurate prediction of failure and evaluation of disk drive condition, reducing the occurrences of false warnings of impending doom, and increasing the probability of detecting actual impending failures. It further provides a practical means for correcting conditions in the field without replacing drives or loading new control code into a drive. Furthermore, it provides an improved method of transcribing experience data to a predictive algorithm, because it is not necessary for designers to understand why certain parameters interact in the way they do, but only necessary to compute neural network weights based on experience data.