A storage apparatus, for example, is configured by a disk array apparatus. In the disk array apparatus, a redundant arrays of inexpensive drives (RAID) technology for combining a plurality of disks (hard disk drives (HDD) or the like) and managing the combined disks as one virtual disk (RAID group) is employed. By employing the RAID technology, the loss of data stored in a disk and the like are prevented. In addition, according to the RAID technology, the data arrangement and the redundancy in each disk differ in accordance with the level (RAID 1 to 6) of the RAID.
A RAID apparatus refers to a disk array apparatus using the RAID technology. In the RAID apparatus, from the viewpoint of data assurance, control units controlling the RAID apparatus are configured to be redundant, and one pair of the control units is mounted. Each of the control units is called a controller module (hereinafter, referred to as a CM). Each CM controls a storage unit including a plurality of disks described above in accordance with input/output requests (I/O requests and commands) from a host apparatus.
One pair of CMs are connected together so as to be communicable with each other through a communication channel (data transmission channel). As the communication channel, for example, a peripheral components interconnect express (PCIe) is used. In each CM, a PCIe switch (PCIeSW) that is connected to the communication channel and controls communication using the communication channel is provided. Here, a path binding CMs including the communication channel and PCIeSWs connected to both ends of the communication channel is called an inter-CM path. Hereinafter, one pair of CMs may be denoted by CM #0 and CM #1.
In a case where an abnormality occurs in the PCIeSW of one of the one pair of CMs, a normal CM of the other side retracts (degenerates) a suspicious CM (abnormal CM) including the PCIeSW in which the abnormality occurs so as to be cut off, and the operation of the RAID apparatus is continuously performed in accordance with only the normal CM.
However, in a case where an abnormality occurs on an inter-CM path, according to the characteristics of the PCIe, it is difficult to precisely specify one in which the abnormality occurs out of the one pair of CMs. However, since one having a higher possibility of the occurrence of an abnormality out of the one pair of CMs can be determined, the CM having the higher possibility of the occurrence of an abnormality is specified as a suspicious CM.
Accordingly, there is a possibility that a normal CM is erroneously specified as a suspicious CM. Here, an operation of a case will be described with reference to reference signs A1 to A8 illustrated in FIG. 14 in which a normal CM #1 is erroneously specified as a suspicious CM although an abnormality actually occurs in the PCIeSW of the CM #0. FIG. 14 is a sequence diagram that illustrates the operation. In a case where an inter-CM path abnormality (see reference sign A1) occurring on the CM #0 side is detected with the CM #1 being a suspicious CM (see reference sign A2), the normal CM #1 is retracted so as to be cut off from the RAID apparatus (see reference sign A3), and the maintenance of the cut CM #1 is performed (see reference sign A4).
On the other hand, the survived CM #0 continues the operation of the RAID apparatus while allowing the abnormality to remain in the PCIeSW. At this time, even when the abnormality remains in the PCIeSW of the survived CM #0, the survived CM #0 does not perform communication between CMs using the inter-CM path. Accordingly, the operation can be continued using only one CM without affecting the operation of the RAID apparatus.
However, when the maintenance of the suspicious CM #1 that has been erroneously specified is performed, CM #1 after the maintenance is inserted into the RAID apparatus, and the communication between the CMs using the inter-CM path is restarted, due to the abnormality of the PCIeSW remaining in the CM #0, and thus a communication abnormality occurs again (see reference sign A5). Accordingly, again, the CM #1 after the maintenance is erroneously specified as a suspicious CM, and the CM#1 after the maintenance is retracted and cut off (see reference sign A6). In a case where the maintenance fails as above, the power of the RAID apparatus is turned off, and, after the maintenance/replacement of the CM #0 is performed (see reference sign A7), the power of the RAID apparatus is re-input (see reference sign A8).
As described above, in a case where a normal CM is erroneously specified as a suspicious CM, the power of the RAID apparatus is turned off and maintenance/replacement of the CM is performed. Accordingly, there is a problem in that the operation of the RAID apparatus (system) needs to be stopped.