A RAID (Redundant Arrays of Inexpensive Disks) device which has a plurality of controller modules (each abbreviated as a CM below), and in which these CMs are connected with each other via front-end routers (each referred to as an FRT below) and via PCIe (Peripheral Component Interconnect Express) buses (each referred to as a PCIe bus below) is known.
Multiple items of user data flow in paths connecting the CMs to keep data redundancy. If user data is garbled and propagates to another CM due to a factor such as abnormality inside an LSI (Large Scale Integration) device, normal data is lost and redundancy cannot be kept, and therefore there is a concern that a system down occurs.
FIG. 10 is a view illustrating a process when abnormality is detected in a PCIe switch in the conventional RAID device. In an example illustrated in FIG. 10, a RAID device 100 includes four CMs 200, two service controllers (each referred to as an SVC) 300 and four FRTs 400.
Each CM 200 includes a monitoring FPGA (Field Programmable Gate Array) 201, a CPU 202 and a PCIe switch 203.
The PCIe switch 203 includes a port 2031 and an NTB (Non Transparent Bridge) port 2032. The port 2031 relays data received by the PCIe switch 203 to transfer to an external device. Further, each port 2031 includes a register which stores detected error factor and configuration information which is necessary to transfer data.
The PCIe switch 203 is connected with the CPU 202 via the port 2031. The CPU 202 performs various types of control on the CM 200, and corresponds to a route complex of a PCIe.
Further, the PCIe switch 203 is connected with a PCIe switch 401 of the FRT 400 via the NTB port 2032.
The NTB port 2032 converts a domain (address) of data to transmit, into a domain supported by the other party in order to transfer data between the CMs 200. In addition, the NTB port 2032 also includes the same register as that of the above port 2031.
Each monitoring FPGA 201 includes an inter-FPGA communication control block 2011 and an error detection logic 2012. The error detection logic 2012 receives an input of an error notification signal from the PCIe switch 401 of each FRT 400.
The inter-FPGA communication control block 2011 is connected with an inter-FPGA communication control block 3011 of a monitoring FPGA 301 of each SVC 300 via an inter-FPGA communication data bus.
In this way, the monitoring FPGA 301 of each SVC 300 is connected with the monitoring FPGAs (monitoring devices) 201 of all CMs 200 via the inter-FPGA communication data buses. Further, a data bus from each CM 200 is branched into and connected with the two SVCs 300 to configure a redundant bus.
Each SVC 300 includes the monitoring FPGA 301. Each monitoring FPGA 301 includes the inter-FPGA communication control block 3011 and a power-off control function 3012. The power-off control function 3012 notifies the PCIe switch 401 of the specified FRT 400 of a power off request to perform control to power off this PCIe switch 401.
Each FRT 400 includes the PCIe switch 401. Each PCIe switch 401 includes a plurality of ports 4011, and each CM 200 is connected to each port 4011.
Each port 4011 of the PCIe switch 401 also has the same function and configuration as those of the port 2031 of each PCIe switch 203, and relays data received by the PCIe switch 401 to transfer to an external device.
The PCIe switch 401 includes a plurality of ports 4011. In this regard, only one port 4011 is illustrated in FIG. 10 for ease of illustration. Each port 4011 of the PCIe switch 401 is connected with the NTB port 2032 of the PCIe switch 203 of the CM 200 via a PCIe bus. Thus, the CMs 200 communicates with each other via the FRTs 400.
The PCIe switch 401 has a function of, when detecting abnormality which is likely to cause uncorrectable data garbling (so-called 2-bit garbling) such as a PCIe uncorrectable error, notifying an external device of this abnormality by outputting an error notification signal. This error notification signal is input to all CMs 200.
In addition, it is possible to mount on this PCIe switch 401 the same NTB port as the NTB port 2032 of the above PCIe switch 203 yet there is a system limitation that only one NTB port can be mounted. Therefore, it is not possible to allocate NTB ports to all ports of the PCIe switches 401 connected with the PCIe buses.
A process in case where the PCIe switch 401 of the FRT 400 detects abnormality in the conventional RAID device configured as described above will be described.
When the PCIe switch 401 detects abnormality (see a reference numeral A1 in FIG. 10), the PCIe switch 401 notifies the monitoring FPGAs 201 of all CMs 200 of that the abnormality has been detected by issuing error notification signals (a reference numeral A2 in FIG. 10).
Each monitoring FPGA 201 which has received the error notification signal specifies a transmission source of the error notification signal as an error factor according to the error detection logic 2012. Each monitoring FPGA 201 notifies the monitoring FPGA 301 of the SVC 300 of a power off request of the error factor PCIe switch 401 by using an inter-FPGA communication data signal (see a reference numeral A3 in FIG. 10).
The monitoring FPGA 301 of the SVC 300 which has received the error notification signal issues operating power supply off control to the specified FRT 400 (see a reference numeral A4 in FIG. 10).
An operating power supply of the FRT 400 for which the operating power supply off control has been issued is powered off, and PCIe bus communication with the CM 200 is disconnected (see a reference numeral A5 in FIG. 10). Thus, the FRT 400 is separated from all CMs 200 (see a reference numeral A6 in FIG. 10).
Patent Literature 1: Japanese Patent Application Laid-Open No. 2005-293595
Patent Literature 2: Japanese Patent Application Laid-Open No. 11-191073
However, in such a conventional RAID device, in case where data is transferred from the specific CM 200 to the FRT 400, and when abnormality which causes data garbling inside the PCIe switch 203 of the CM 200 occurs, the PCIe switch 401 of the FRT 400 which receives this data detects an error.
The PCIe switch 401 which has detected the error issues an error notification signal to an external device. As described above, the monitoring FPGA 301 of the SVC 300 which has received the error notification signal issues operating power supply off control to the specified FRT 400, and the operating power supply of the FRT 400 for which the operating power supply off control has been issued is power off and is separated from all CMs 200.
Fundamentally speaking, it is desirable to minimize an influence caused by error detection by separating only the CM 200 in which abnormality has occurred. However, the above conventional RAID device has a problem that the influence eventually spreads to all CMs 200 connected with the FRT 400 including the PCIe switch 401.