1. Technical Field
The present invention relates to identifying defective components in a computing system. More particularly, the invention concerns storing information concerning the paths of data objects in a computing system to facilitate identifying defective components.
2. Description of Related Art
Important data is frequently stored in computing systems. If a data object becomes corrupted, it is desirable to be able to quickly identify the cause of the data corruption, so the problem can be eliminated. A problem may be eliminated, for example, by replacing a defective component. Quickly identifying and replacing defective components can limit the amount of corrupted data and associated costs. The task of identifying the cause of data corruption is particularly challenging in computing systems that utilize a large number of storage devices, and which have a large number of paths over which data objects may travel, such as when a storage area network (SAN) is utilized.
Utilizing a storage area network increases the complexity of a computing system. When a data object is stored in a storage area network, the data path is rarely a simple point-to-point transfer, and instead, may involve multiple interfaces and devices. Consequently, a data object may travel over any of a number of paths between a source and a destination when being stored. In this case, when an error is detected in stored data, it is often difficult to determine the cause of the error. Computing environments of directly attached storage also suffer from similar problems.
One traditional method of error tracking involves examining information that devices in a computing system provide for diagnosing problems. For example, device logs may be examined to try to identify one or more devices that have experienced an error, and to try to identify the type of error that has occurred. Server error reports may also be examined. One problem with this technique is that an error investigation may take place after the error has been flushed from the relevant error logs, and consequently the error cannot be determined. Also, even if a device that has experienced an error is identified, it is often difficult to determine whether the path of a particular data object included the device that experienced the error.
It is difficult and time consuming to examine every device in a storage area network or large network of locally attached devices. The number of devices connected to a storage area network may be large, and the number of possible connections between devices increases exponentially as the number of storage devices increases.
Consequently, with large storage area networks, examining error logs, and determining when or if a device handled a data object is a daunting, if not impossible task. The difficulty is compounded in heterogeneous computing environments. Frequently, storage management software is erroneously blamed for data errors.
Cyclical Redundancy Checking (CRC) is another known technique for error detection. However, the usefulness of utilizing CRC checking for error detection is limited because many computing environments cannot tolerate the performance cost of CRC techniques at each transfer. Additionally, CRC checking facilitates identifying only a limited set of devices in the data path of a data object for further investigation.
In another known technique for error tracking, devices in a storage area network are relied upon to report data transfer errors to a storage manager server, so the server can notify a client to retry an operation if necessary. However, the success of this technique is dependent on errors being reliably reported to the server, which often does not occur. For example, a defective device itself may not detect an error, and therefore will not make an entry in its error log, and will not report the error to the server or a calling application. Examples of undetected errors that devices may fail to report include flipped bits and the failure to store a file.
In addition to the factors discussed above, traditional error tracking methods are often inadequate when data storage errors are intermittent and are associated with individual data objects, which is frequently the case. In summary, known error tracking techniques are generally inadequate for quickly and accurately identifying malfunctioning components in a computing system.