This invention relates to enterprise-wide data storage systems, and in particular, to methods and systems for selectively correcting errors in data stored in such a system.
When we store data on a disk, we often take it for granted that we will one day be able to recover that identical data from the disk. In reality, there are many errors made in storing data on a disk. Fortunately, modem data storage systems provide error-management utilities for largely eliminating the undesirable effects of these data errors. These error-management utilities include both scanning utilities that periodically scan the disk for data errors, and error-correction utilities that repair errors identified by the scanning utilities.
The error-management utilities operate unobtrusively in the background. Periodically, the scanning utility scans the entire disk for data errors. When the scanning utility identifies a data error, it writes information descriptive of that data error to an output device, such as the printer or a monitor. This information takes the form of an unstructured stream of text.
It is not the case that every data error identified by the scanning utility will be repaired by an error-correction utility. In some cases, a data error is so severe that it cannot be repaired at all. In other cases, repair of a particular data error can result in other, more serious errors. Thus, an error-correction utility generally does not blindly repair all disk errors identified by a scanning utility. Instead, there is typically a filtering step in which the error-correction utility is made to repair only selected data errors. This filtering step is performed by a human operator who monitors the data errors as they are listed at the output device and compiles a list of those data errors that are to be repaired.
Once the scan is complete, the human operator executes the error-correction utility. For each data error on the list of data errors to be repaired, the operator executes the error-correction utility. In doing so, the operator provides the error-correction utility with an argument list that causes the error-correction utility to repair that particular error.
The foregoing method is practicable when the number of errors is relatively small. However, as data storage systems have become progressively larger, the number of data errors encountered during a disk scan has likewise become proportionately larger. As a result, it has become increasingly difficult for a human operator to digest a list of data errors and to prepare instructions for an error-correction utility within the time constraints required for reliable operation of the data storage system.
As data storage systems continue to grow in their storage capacity, it is foreseeable that a human operator will no longer be able to even complete execution of the error-correction utility for a particular scan before it is time to begin the next scan.
The invention provides a method of scanning a mass-storage device in a manner that makes information obtained during that scan directly available to an error-correction utility. This enables the error-correction utility to directly determine, with a minimum of human intervention, whether to repair particular data errors.
In a system incorporating the invention, a system scan buffer is allocated in a global memory in data communication with a mass-storage device. The mass-storage device is then scanned by a scanning utility. As the scanning utility performs the scan, it detects data errors in the mass-storage device. When it does so, it writes information descriptive of those data errors to the scan buffer. This information is thus available for later access by an error-correction utility or by a human operator.
The information written to the scan buffer can include an error code indicative of a type of data error. This is useful because it enables an error-correction utility to automatically determine whether or not the data error is of the type that it ought to repair.
The information written to the scan buffer can also include a status flag indicative of whether the data error has been repaired or a repair flag indicative of whether the data error is to be repaired. The status flag enables the data error to remain in the scan buffer even though it may have already been repaired. The repair flag provides a mechanism for allowing a human operator to override decisions made by an error-correction utility.
Because certain error-correction utilities are only capable of repairing data errors identified by particular scanning utilities, each entry in the scan buffer can also include a signature identifying the scanning utility that detected the data error.
An error-correction utility functions more effectively when it knows where the data error occurred. To provide this information, each entry in the scan buffer can also include an address code indicative of a logical location of the data error in the mass storage medium.
In some data storage systems, a plurality of mass-storage devices is in communication with the global memory. For such systems, information from the various mass-storage devices can be interleaved in the scan buffer. In this case, the scan buffer includes information descriptive of a data error includes information identifying the mass-storage device in which the data error occurred.
The invention also encompasses a method of repairing a data error in a mass storage system having a global memory in communication with at least one mass-storage device. In this method, an error-correction utility retrieves information descriptive of the data error from a scan buffer in global memory. On the basis of this information, the error-correction utility determines whether the data error is to be repaired. If the data error is to be repaired, the error-correction utility attends to the repair. Otherwise, the error-correction utility proceeds to obtain information about other data errors, if any, in the mass-storage device.
The error-correction utility can implement a programmed rule for deciding, on the basis of the information descriptive of the data error, whether the data error is to be repaired. Such information is preferably embodied in the form of a flag. Alternatively, the information descriptive of the data error can be displayed to a system operator. The system operator then makes a manual determination of whether or not that data error is to be repaired. If it is, the system operator alters the entry corresponding to that data error so that the error-correction utility will recognize that that data error is to be repaired.
The invention also includes within its scope a data storage system having a mass-storage device and a global memory in data communication with the mass-storage device. The global memory contains a scan buffer containing information descriptive of data errors in the mass-storage device. Typically, the information is organized in the scan buffer into a sequence of error entries, each one of which corresponds to a data error. The individual error entries are divided into fields that contain information used by an error-correction utility for deciding whether or not the particular data error associated with that error entry is to be repaired.
The error entry is structured to contain one or more fields containing particular types of information. These fields can include an error-class field containing information indicative of a type of data error, a status-flag field containing information indicative of whether the data error has been repaired, a repair-flag field containing information indicative of whether the data error is to be repaired, a signature field containing information identifying the scanning utility that detected the data error, a time-stamp field containing information indicative of when the data error was recorded in the buffer, and an address field containing a logical location of the data error in the mass storage medium.
These and other features of the invention will be apparent from the following detailed description and the accompanying figures, in which: