In a storage device such as a Redundant Arrays of Inexpensive Disks (RAID) device, generally Controller Modules (CMs) to control the device are redundantly mounted. In addition, these CMs are communicably connected via a general-purpose interface such as PCI Express (hereinafter, called PCIe) or a manufacturer specific interface to communicate with each other.
When one of CMs fail, the CM that has failed (failed CM) is isolated (degraded) and the operation of a RAID device is continued by using another normal CM. Thus, to prevent propagation of a failure from the CM that has failed to other normal CMs, it is desirable to instantly cut off connection between CMs when a failure is detected.
FIG. 5 is a diagram showing the configuration of a conventional storage device 100.
The storage device 100 shown in FIG. 5 includes two CMs 110A, 110B and a storage unit 201.
The storage unit 201 includes one or more storages (not shown) and provides storage areas of these storages to a host device (not shown) connected via CA (Communication Adapter) 130, 130.
The CMs 110A, 110B are used to perform various kinds of control and perform various kinds of control such as access control to the storage unit 201 according to a storage access request (access control signal: hereinafter called a host I/O) from the host device. The CMs 110A, 110B have mutually almost the same configuration.
Hereinafter, when it is necessary to identify one of a plurality of CMs, the reference signs 110A, 110B are used as reference signs indicating CM, but a reference sign 110 is used when any CM is indicated.
The CM 110 includes, as shown in FIG. 5, a Central Processing Unit (CPU) 112, a Field-Programmable Gate Array (FPGA) 113, devices 121A, 121B (Device#0), a device 160 (Device#1), a device 122 (Device#2), and a disk interface 123.
The CPU 112 is a processing unit that performs various kinds of control or operation and realizes various functions such as RAID control by executing a program stored in a memory (not shown) or the like.
The disk interface 123 is, for example, an Serial Attached Small Computer System Interface (SAS) interface that is communicably connected to storages or the like in the storage unit 201. The disk interface 123 is also connected to a channel 151 and also functions as an interface unit that controls communication through the channel 151. The channel 151 communicably connects the disk interface 123 of the CM 110A and the disk interface 123 of the CM 110B.
The device 122 is a switch device that functions as a bridge connecting the CA 130, the CPU 112, and the disk interface 123 and is, for example, a PCIe switch.
The CPU 112, the host device, and the storage unit 201 are communicably connected via the device 122. That is, under the control of the CPU 112, a write operation or read operation of data is performed on the storage unit 201 in response to an I/O request from the host device via the device 122 and the disk interface 123. Accordingly, data can be written/read into/from the storage unit 201 from both of the CMs 110A, 110B.
The devices 121A, 121B are each connected to a channel 152 and are interface units that control communication by the channel 152. The device 121A is included in the CM 110A and the device 121B is included in the CM 110B. These devices 121A, 121B are, for example, PCIe switches and hereinafter, these devices 121A, 121B may be called PCIe switches 121A, 121B.
The PCIe switches 121A, 121B perform data communication conforming to the standard of PCIe between the CM 110A and the CM 110B via the channel 152.
These PCIe switches 121A, 121B has the same configuration. Hereinafter, when it is necessary to identify one of a plurality of PCIe switches, the reference signs 121A, 121B are used as reference signs indicating a PCIe switch, but a reference sign 121 is used when any PCIe switch is indicated.
The PCIe switches 121A, 121B also function as bridges connecting the CPU 112, the channel 152, and the device 160.
The channel 152 communicably connects the PCIe switch 121A of the CM 110A and the PCIe switch 121B of the CM 110B.
In each of the CMs 110A, 110B, the PCIe switches 121A, 121B and the CPU 112 are communicably connected by a channel 141 of PCIe respectively.
The device 160 is, for example, a Solid State Drive (SSD) and is used as a secondary cache of the CPU 112. In each of the CMs 110A, 110B, the device 160 is connected to the devices 121A, 121B and the CPU 112 accesses the device 160 via the devices 121A, 121B respectively.
The FPGA 113 is large-scale integration (LSI) that can be programmed and realizes a function to cut off connection between CMs when an error is detected in the CMs 110A, 110B.
The FPGA 113 includes, as shown in FIG. 5, error detection logic 114, an inter-FPGA communication controller 117, a reset register 115, and a cause register 116.
The cause register 116 is a register in which information identifying the cause of an error that has occurred is recorded and information indicating one of the devices 121, 122 and the CPU 112 is recorded.
The reset register 115 is a register that controls a reset state of Device#0. Device#0 is reset when “1” is recorded in the reset register 115. That is, when “1” is recorded in the reset register 115, “1” is input to a reset terminal provided in the device 121 as a reset instruction and the device 121 is put into a reset state. Accordingly, the interface between CMs is disconnected. When “0” is recorded in the reset register 115, the reset state of Device#0 is canceled.
The error detection logic 114 monitors for an error notification signal from the devices 121, 122 and the CPU 112. When an error is detected in the device 121, the error detection logic 114 is notified of a Device#0 error notification signal and when an error is detected in the device 122, the error detection logic 114 is notified of a Device#2 error notification signal. When an error is detected in the CPU 112, the error detection logic 114 is notified of a CPU error notification signal.
When the error detection logic 114 is notified of an error from one of the devices 121, 122 and the CPU 112, the error detection logic 114 sets “1” to the reset register 115 and also records information to identify the transmission source of the error notification in the cause register 116.
A technique to cut off connection between CMs when an error is detected in the CM 110 of the conventional storage device 100 will be described.
When, for example, an error occurs in the CPU 112 of the CM 110A in the example shown in FIG. 5 (see reference sign A1), the CPU 112 transmits a CPU error notification signal to the error detection logic 114.
The error detection logic 114 records “1” in the reset register 115 and also records in the cause register 116 that the CPU 112 is the error cause (see reference sign A2). With “1” being recorded in the reset register 115, the device 121A is reset (see reference sign A3) and CM communication between the CM 110A and the CM 110B is cut off (see reference sign A4).
With the communication between CMs being cut off as described above, the cutoff of communication between CMs is detected by the device 121B of the CM 110B as an error (see reference sign A5). The device 121B transmits a Device#0 error notification signal to the error detection logic 114 of the FPGA 113 of the CM 110B (see reference sign A6). Also in this CM 110B, the error detection logic 114 records “1” in the reset register 115 and also records in the cause register 116 that the CPU 112 is the error cause (see reference sign A7).
Also in this CM 110B, with “1” being recorded in the reset register 115, the device 121B is reset (see reference sign A8). That is, in the CM 110B in which no error has occurred, the device 121B is reset based on an error detected by the communication between CMs being cut off.
Patent Literature 1: Japanese Laid-open Patent Publication No. 10-187473
Patent Literature 2: Japanese Laid-open Patent Publication No. 2008-59558
In the conventional storage device 100 as described above, however, the device 121B is reset also in the CM 110B and so the CPU 112 can no longer use the device 160 connected under the device 121B (see reference sign A9). That is, a problem that an error that has occurred in the CM 110A makes the device 121B in the normal CM 110B unusable arises.