1. Field of the Invention
The present invention relates to a technology for identifying a failure module in a disk controller including a plurality of modules.
2. Description of the Related Art
Conventionally, a storage system (for example, a storage device) including a plurality of disk devices, which can increase memory space and enhance input/output performance, has been suggested. When a failure occurs in a disk device, only the failed disk device needs to be replaced so that the storage device can continue operating.
To achieve a fault-tolerant storage device, components other than the disk device (that is, a module such as a controller) also need to be provided in redundancy. When a failure occurs in a module, only the failed module needs to be replaced so that the storage device can continue operating.
For example, Japanese Patent Application Laid-Open No. H11-306644 discloses a technology for detaching a failed disk device, and diagnosing the failure of the detached disk device. Moreover, Japanese Patent Application Laid-Open No. S60-10328 discloses a technology for determining, when a failure occurs, whether the failure occurred in the disk device itself or a channel device connected to the disk device.
The conventional technology can detect a failure by a module provided with a failure detecting mechanism. However, the conventional technology cannot identify the module where the failure occurred, because modules without the failure detecting mechanism exist on the same data path.
FIG. 11 is a conceptual diagram for explaining a conventional failure detecting method. It is assumed that the method is performed in a disk array device. A server writes/reads data in/from a disk device. The data is guaranteed to prevent data corruption.
The disk array device includes a channel adapter (CA) that controls a connection with a server, a device adapter (DA) controls a connection with a disk device, a controller module (CM) that controls the entire disk array device and typically includes a memory functioning as a disk cache, and a router (RT) that interconnects the CA, the DA, and the CM.
Each of the modules is provided in redundancy. Thus, when a failure occurs in a module, the disk array device can continue operating by replacing the failed module.
Data passing though the modules is checked to guarantee the data. For example, the CA and the DA perform a cyclic redundancy check (CRC) on the data. The CRC is performed by appending a CRC code of 16 bits to 32 bits to the data, and detecting a bit error in the data by using the CRC code. With the CRC, an error can be detected even when a plurality of bits changes. Thus, the CRC is often used for checking data in a disk controller.
On the other hand, the CM and the RT typically perform a parity check. The parity check can only detect a bit error of 1 bit, and cannot detect an error when a plurality of bits changes. The disk array device includes modules that only perform a parity check and modules that do not (cannot) check the data at all.
When a module performing the CRC (CA or DA) detects a data error, a module on the same data path (CM or RT) might include the error. However, because the CM and RT do not perform the CRC, the location of the error cannot be identified.
Thus, the conventional method cannot identify a module where a failure occurred, and therefore cannot determine which module is to be replaced. As a result, the disk array device cannot be recovered quickly and efficiently after a failure. Specifically, when a failure occurs, a maintenance staff has to refer to failure logs to identify the module where the failure occurred, and replace the failed module, which can lead to a system shutdown. However, accelerating progress of data processing systems calls for fault-tolerant systems in which a module with an error is identified and replaced quickly and efficiently to avoid a system shutdown.