In a system having redundancy, in the case where parts constituting the system break down, the system separates the broken-down parts to block a failure. A method which separates the broken-down parts may include, for example, cut-off of power of the broken-down part or cut-off of a switch of a high-speed transmission channel bus.
Preferably, firmware of the system correctly specify the broken-down parts when system internal abnormality is detected. In the system having the redundancy, when one module operates, the system can continuously operate, and as a result, it is very important to block the failure.
FIG. 7 is a diagram illustrating a configuration example of an information processing apparatus 100. As illustrated in FIG. 7, the information processing apparatus 100 includes a central processing unit (CPU) 200, a monitoring field programmable gate array (FPGA) 300, a non-volatile memory 400, and devices 500A to 500C.
The CPU 200 is a device that performs various controls or calculations in the information processing apparatus 100, and includes a core 210, a random access memory (RAM) 220, a high-speed interface (IF) 230, and a low-speed IF 240.
The core 210 performs various processing operations as the CPU 200. For example, the core 210 controls the device 500A through a high-speed transmission channel 700a by the high-speed IF 230 and controls the devices 500B and 500C via the device 500A. Further, the core 210 is connected with the respective devices 500A to 500C through a low-speed transmission channel 700b by the low-speed IF 240 for log collection.
The devices 500A to 500C are various devices constituting the information processing apparatus 100. The device 500A is, for example, a switch module, and is connected with the arranged devices 500B and 500C through the high-speed transmission channel 700a and connected with redundant another module through the high-speed transmission channel 700a. The devices 500B and 500C are redundant devices and for example, adapters which communicate with a device such as a disk device or a host device.
When the CPU 200 (core 210) detects abnormality in the devices 500A to 500C through the high-speed IF 230, the CPU 200 acquires a log (status information) such as a register dump from each of the devices 500A to 500C via the low-speed transmission channel 700b through the low-speed IF 240. Further, the CPU 200 stores the acquired log in a log area 220a of the RAM 220.
In addition, the CPU 200 specifies an occurrence location of the failure based on the acquired register dump and for example, when the occurrence location of the failure is the device 500A, the CPU 200 disconnects the device 500A from redundant another module in order to remove the failure from the system. In this case, the information processing apparatus 100 including the corresponding CPU 200 is separated from the another module and the system continuously operates by the another module.
Note that, the monitoring FPGA 300 is hardware that performs monitoring and controlling of an LED, a power supply, reset processing, or the like in the information processing apparatus 100 and the non-volatile memory 400 is a memory that holds information or the like on monitoring and controlling by the monitoring FPGA 300.
In addition, as a related technology, a technology is known, in which a processor transmits a content of a memory of a channel device or an error log control circuit to a main memory by occurrence of a failure, or the like (see, for example, Patent Literature 1 or 2).
Moreover, a technology is known, in which a log in normal times or detecting an error is accumulated in an internal buffer or the like by a login circuit or a logic circuit constituting hardware (see, for example, Patent Literature 3 or 4).
[Patent Literature 1] Japanese Laid-open Patent Publication No. 57-6951
[Patent Literature 2] Japanese Laid-open Patent Publication No. 58-96326
[Patent Literature 3] Japanese Laid-open Patent Publication No. 2004-348306
[Patent Literature 4] Japanese Laid-open Patent Publication No. 10-207790
In an example illustrated in FIG. 7, there is a risk that the CPU 200 enters exceptional processing to be hung up when there is no read response from the devices 500A to 500C at the time of acquiring the register dump.
FIG. 8 is a sequence diagram illustrating an operation example of the information processing apparatus 100 when the failure occurs in the device 500C illustrated in FIG. 7. As illustrated in FIG. 8, after power is input into the information processing apparatus 100 (step T110), when the failure occurs in the device 500C (step T120), an error is notified to the CPU 200 from the device 500C via the high-speed transmission channel 700a (step T130). When the error is notified, a dump is acquired (read) from a register (not illustrated in FIG. 7) of the device 500C via the low-speed transmission channel 700b by the CPU 200 (step T140).
Herein, in the case where the device 500C is unable to perform reading and responding due to the failure which occurs in the device 500C (step T150), the CPU 200 is hung up in completion and stand-by states of reading the register (step T160). In this case, since the CPU 200 is unable to collect the register dump when the failure occurs, the CPU 200 is unable to determine a failure location and the CPU 200 fails in separating the device 500C which is the broken down part, and as a result, the CPU 200 fails in blocking the error.
Further, when the CPU 200 fails in separating the device 500C, a failure state of the device 500C may be propagated to the device 500A through the high-speed transmission channel 700a (step T170). In this case, further, the failure state propagated to the device 500A is propagated to even the other module through an intermodule bus (high-speed transmission channel 700a) and both redundant modules stop operating, and as a result, the system may be continuously unavailable (machine down).
In this case, since the CPU 200 runs into an inoperative state during log collection, the register dump cannot also be collected from the device 500C when a phenomenon occurs. Further, since the register dump cannot be collected, it is also difficult to investigate a cause after exchanging a broken-down module and for example, taking over the broken-down module into a factory.
Note that, in the aforementioned related technologies, the aforementioned is not considered.