There is a storage device that includes plural controller modules (CM) which are connected by a peripheral component interconnect express (PCIe) bus and are capable of mutual communication.
FIG. 11 is a diagram that illustrates the configuration of the CMs included in a storage device in related art. The example illustrated in FIG. 11 illustrates two CMs 300-1 and 300-2 that are included in the storage device.
The CMs 300-1 and 300-2 are provided for redundancy, and these CMs 300-1 and 300-2 have similar configurations. Hereinafter, as the reference numerals that denote the CMs, a reference numeral 300-1 or 300-2 will be used when it is desired to identify one of the plural CMs, but a reference numeral 300 will be used to denote an arbitrary CM. Further, the CM 300-1 may be referred to as CM #0, and the CM 300-2 may be referred to as CM #1.
The CM 300 is a control device that performs various kinds of control in the storage device and performs miscellaneous kinds of control such as access control to a memory device in accordance with a storage access request from a host device, which is not illustrated. The CM 300 includes a channel adapter (CA) 313, a central processing unit (CPU) 311, and a PCIe switch 312. The CA 313 is an interface controller that enables communication with the host device, which is not illustrated, and so forth.
The CPU 311 is a processing device that performs various kinds of control and computation. The CPU 311 is connected with the CA 313 and the PCIe switch 312 via the PCIe bus. For example, the CPU 311 of the CM 300-1 includes a port 401C and is connected with the PCIe switch 312 via the port 401C. Further, the CPU 311 of the CM 300-2 includes a port 401F and is connected with the PCIe switch 312 via the port 401F.
The PCIe switch 312 is a relay device that relays data transfer in accordance with a PCIe protocol. The PCIe switch 312 includes plural ports, and apparatuses that serve as transmission sources or transmission destinations of data are connected with those ports. In the example illustrated in FIG. 11, the PCIe switch 312 of the CM 300-1 includes two ports 401B and 401A. The PCIe switch 312 of the other CM 300-2 is connected with the port 401A. Further, the CPU 311 of the CM 300-1 is connected with the port 401B.
Similarly, the PCIe switch 312 of the CM 300-2 includes two ports 401D and 401E. The PCIe switch 312 of the other CM 300-1 is connected with the port 401D. Further, the CPU 311 of the CM 300-2 is connected with the port 401E. Hereinafter, as the reference characters that denote the ports, reference characters 401A to 401F will be used when it is desired to identify one of the plural ports, but a reference numeral 401 will be used to denote an arbitrary port.
Each of the ports 401 includes a transmission circuit Tx and a reception circuit Rx. The transmission circuit Tx included in the port 401A will be denoted by a reference character Tx-A, and the reception circuit Rx included in the port 401A will be denoted by a reference character Rx-A. Similarly, the transmission circuits Tx included in the ports 401B to 401F will be denoted by reference characters Tx-B to Tx-F, respectively. Further, the reception circuits Rx included in the ports 401B to 401F will be denoted by reference characters Rx-B to Rx-F, respectively.
Further, each of the transmission circuit Tx and the reception circuit Rx includes a buffer and performs data communication by using the buffer. That is, the buffer is used to temporarily store data in transmission. Incidentally, when one or more buffers become full on the PCIe bus, no more data may be stored in the buffers, resulting in stagnation of communication processes.
For example, as denoted by a reference character P01 in FIG. 11, a case will be discussed where failure in which data may not be transmitted to the reception circuit Rx-D of the port 401D as the transmission destination occurs in the transmission circuit Tx-A of the port 401A of the PCIe switch 312. In such a case, the buffer of the transmission circuit Tx-A soon becomes full. As a result, data may not be transmitted from the reception circuit Rx-B of the port 401B to the transmission circuit Tx-A of the port 401A in the PCIe switch 312 of the CM #0. Fullness of the buffer spreads in the data communication path, the transmission circuit Tx-C of the CPU 311 of the CM #0 finally becomes full, and the CM #0 becomes a hang-up state.
As described above, for example, in a case where failure in which data may not be transmitted to the CM #1 as the transmission destination occurs in the CM #0, it is easy to identify the CM to be the target of maintenance work for solving the failure, that is, a maintenance-targeted CM as the CM #0. In other words, a failure occurrence site stays in a closed system that is the CM #0. Thus, the maintenance-targeted CM may easily be identified as the CM #0.
Further, in a case where the maintenance-targeted CM is identified, the maintenance-targeted CM is restarted (CM rebooting) or separated in order to restore the system.
However, there may be a case where it is difficult to identify the maintenance-targeted CM depending on the circumstance of an occurrence of failure. For example, as denoted by a reference character P02 in FIG. 11, such a circumstance may be a case where failure in which the transmission circuit Tx-D of the port 401D of the PCIe switch 312 of the CM #1 transmits data to the reception circuit Rx-A of the port 401A as the transmission destination but the reception circuit Rx-A may not process the data occurs.
In this case, because the transmission circuit Tx-D of the port 401D may not confirm completion of processing of the transmitted data, stagnation (time-out) of the communication between the CMs is finally detected, and a determination is made that the path between the CMs 300-1 and 300-2 has failure. However, in this case, both of the reception circuit Rx-A of the port 401A and the transmission circuit Tx-D of the port 401D may be considered as failure sites, and the maintenance-targeted CM may not be identified as the CM #0 or the CM #1.
As described above, in the storage device in related art, the CM rebooting or separation is performed in order to restore the system in a case where failure is detected in the CM. However, selecting a wrong CM as the maintenance-targeted CM may result in a system crash.
Japanese Laid-open Patent Publication No. 2008-288740, Japanese Laid-open Patent Publication No. 2000-183873, and Japanese Laid-open Patent Publication No. 9-191321 are examples of related art.