A storage apparatus which stores data transmitted from a host has a controller module (hereinafter, called a “CM”) for storing received data in a hard disk. The CM has a volatile memory. For faster processing, the received data is written by storing data in a memory of the CM once and then writing the data in the memory to a hard disk.
In recent years, in order to meet the request for improvements in reliability, a plurality of CMs are provided in a storage apparatus, and the CMs are mutually connected to duplex data in memories in the CMs so that processing may be continued even when a CM fails.
In the duplexing, it has been considered that an error occurring in one CM does not influence the inside of the other duplexed CM. Accordingly, a failing CM has been identified on the basis of device error information in the CM, and isolation processing has been performed including shutting down the CM.
However, it has been emerged that some types of errors may influence the inside of the other CM. In this case, it is determined that not only the failing CM but also a CM that is not failing have a failure, and both of the CMs are shut down, which is a problem. The types of error that cause the problem may include a failure in a PCIe (Peripheral Component Interconnect express) bridge, for example.
An operation by a storage when a PCIe bridge fails will be described more specifically below. In this case, a storage apparatus has two CMs of a CM #0 and a CM #1, for example.
It is assumed that an error occurs at a PCIe bridge in the CM #1. The error may be caused by a data parity error, for example. Accordingly, the CM #1 determines from the error information that CM #1 has a failure and shuts down the CM #1.
On the other hand, an error is also detected at a PCIe bridge in the CM #0. The CM #0 determines from the error information that the CM #1 has an error.
Here, the error data having passed through the PCIe bridge of the CM #0 reaches a memory controller and results in a parity error. The CM #0 then detects the error due to the parity error in the memory controller. In this case, the CM #0 determines that the memory controller is failing and shuts down the CM #0.
Before the CM #0 is shut down, the memory controller having received the error data also writes the error data to a memory without stopping the error data. In this case, though the memory is ECC (Error Check and Correct)-protected, the memory controller generates the ECC on the basis of the error data. As a result, the fact that the written data has an error may not be detected.
Against this problem, various arts have been proposed for improved fault tolerance of storages. For example, when an error occurs during data writing from a CM to a disk, an art in the past reconstructs the path from the failing CM to the disk to complete the data writing. Another art in the past has been proposed which determines a failing part when a failure occurs. Another art in the past has been proposed which identifies a part having a failure and restarts it. Another art in the past has been proposed which selects a preferable path for improved access performance. Japanese Laid-open Patent Publication Nos. 2006-107053, 2000-181887, 2009-266119 and 2007-293448 are examples of related art.