In large-scale systems including server devices of a backbone system such as a plurality of mainframes and input output (IO) devices, an IO control device that connects channels of the plurality of mainframes and the IO devices with each other by dynamic switching is sometimes equipped.
FIG. 20 is a diagram illustrating an exemplary configuration of an information processing system 100.
The information processing system 100 illustrated in FIG. 20 includes information processing devices 200-1 and 200-2, IO devices 400-1 and 400-2, and a switch device 500.
The information processing devices 200-1 and 200-2 transmits or receives data or a command to or from the IO devices 400-1 and 400-2 or a control unit 900 via the switch device 500 through channels (denoted by CHs in FIG. 20) 300-1 and 300-2.
Note that, in the example illustrated in FIG. 20, for example, a mainframe (MF) is used as the information processing devices 200-1 and 200-2. Further, various kinds of storage devices including a magnetic disk device such as a hard disk drive (HDD), a semiconductor disk device such as a solid state drive (SSD), or a tape drive, or a console may be used as the IO device 400-1 and 400-2.
The switch device 500 includes external ports 600-1 to 600-4, an internal port 700, a switch unit 800, and a control unit 900.
The external ports 600-1 to 600-4 are connected to channels 300-1 and 300-2 and the IO devices 400-1 and 400-2, respectively. The internal port 700 is equipped in the control unit 900.
Note that, in FIG. 20, the external port 600-1 to which the channel 300-1 is connected is denoted by C0, and the external port 600-2 to which the channel 300-2 is connected is denoted by C1. Further, the external port 600-3 to which the device 400-1 is connected is denoted by D1, the external port 600-4 to which the IO device 400-2 is connected is denoted by D2, and the internal port 700 equipped in the control unit 900 is denoted by FE. In the following description, the external ports 600-1 to 600-4 are referred to as ports C0, C1, D1, and D2, respectively, and the internal port 700 is referred to as a port FE.
The switch unit 800 is connected to the external ports 600-1 to 600-4 and the internal port 700, and manages statuses of the external ports 600-1 to 600-4 and controls a connection relation between arbitrary ports. Through control of a connection relation, the switch unit 800 dynamically switches a connection between the channels 300-1 and 300-2 and the IO devices 400-1 and 400-2, and performs an n-to-n connection (n is an integer of 1 or more).
The control unit 900 is connected with the external ports 600-1 to 600-4 through the internal port 700, and controls configuration control such as online/offline of the external ports 600-1 to 600-4.
The control of the external ports 600-1 to 600-4 by the control unit 900 is performed on a port designated by an instruction given from the information processing device 200-1 or 200-2 or the like based on the instruction. For example, the information processing device 200-1 or 200-2 gives the instruction by issuing a command to the control unit 900 through the external ports 600-1 to 600-4 and the switch unit 800.
Through the switch device 500, in the information processing system 100, a flexible connection between the plurality of information processing devices 200-1 and 200-2 and the plurality of IO devices 400-1 and 400-2 can be made, and the number of channels and the number of connected channels at the time of IO connection can be reduced.
Further, each of the channels 300-1 and 300-2, the external ports 600-1 to 600-4, and the internal port 700 can hold a trace log in its own channel or its own port. The trace log is used for error analysis when an error occurs in the information processing system 100.
An example of an error processing procedure in the information processing system 100 having the above-described configuration will be described below with reference to FIG. 21.
FIG. 21 is a sequence diagram for describing an exemplary error processing procedure in the information processing system 100 illustrated in FIG. 20.
When the channel 300-1 issues a command to the IO device 400-1, the ports C0 and D1 are connected to each other (step S101). In other words, as the command is transmitted from the channel 300-1, the channel 300-1 is connected with the IO device 400-1 through the switch device 500.
In the connection state, for example, when the channel 300-1 detects an error such as an interface control check (ICC) on interaction between the channel 300-1 and the device 400-1 (step S102), content of a trace memory in the channel 300-1 is collected (step S103). The channel 300-1 uses the content of the trace memory as an error log for ICC analysis.
When the content of the trace memory is collected, the channel 300-1 releases (separates) an IO interface between the channel 300-1 and the IO device 400-1 (step S104).
Specifically, the channel 300-1 transmits a command instructing the port C0 to release a connection with the channel 300-1. Upon receiving the command, the port C0 releases a connection between the channel 300-1 and the port C0 (step S104a), and transmits a command instructing the port D1 which is in the connection state with the port C0 to release a connection with the IO device 400-1. Upon receiving the command from the port C0, the port D1 releases a connection between the port D1 and the IO device 400-1 (step S104b).
Even when a connection between the ports C0 and D1 is released in step S104, the channel 300-1 can transmit a next frame to the IO device 400-1.
Note that, since it is difficult for the IO device 400-1 to determine whether the channel 300-1 has detected an error in step S102, the IO device 400-1 recognizes that interaction with the channel 300-1 is continuously being performed. For this reason, the channel 300-1 performs a reset process of resetting the connection with the IO device 400-1 (step S105). The reset process is performed such that the channel 300-1 instructs the IO device 400-1 to reset, and the device 400-1 that is given the reset instruction resets the connection with the channel 300-1 in the IO device 400-1 (step S105a).
As described above, when the channel 300-1 detects an ICC on the interaction with the IO device 400-1, the error process illustrated in FIG. 21 is performed. The error log collected in the error process is used to specify a suspicious point through ICC analysis, and an administrator or an operator repairs or replaces a specified suspicious point and recovers a failure.    [Patent Literature 1] Japanese Laid-open Patent Publication No. 48-071155    [Patent Literature 2] Japanese Laid-open Patent Publication No. 04-336636    [Patent Literature 3] Japanese Laid-open Patent Publication No. 2009-223702
When an error occurs in the information processing system 100 in which the channel 300 is connected with the IO device 400 through the switch device (IO control device) 500, it is preferable to perform failure recovery, that is, to specify, repair, and replace a suspicious point in a short time.
In the error process illustrated in FIG. 21, when the channel 300-1 detects an ICC, the trace log in the channel 300-1 is collected as an error log for error analysis. However, since the switch device 500 has no function of recognizing the fact that the channel 300-1 has detected an ICC, even when the channel 300-1 detects an ICC, it is difficult to recognize whether the trace log in the switch device 500 is necessary.
In other words, when the switch device 500 has a function of storing the trace logs of the IO interface of the external ports 600-1 to 600-4 in a memory, the trace logs of the external ports 600-1 to 600-4 are continuously stored in the memory by another process after the connection release process is performed.
For example, there are cases in which after the error process illustrated in FIG. 21 is performed, another external port 600-2 transmits a connection request to the external port 600-3 at the IO device 400-1 side that has released a connection regardless of an operation of the channel 300-1. Then, when the connection request is received in the external port 600-3, trace content at the time of error occurrence may be overwritten and lost.
When the error process illustrated in FIG. 21 is performed and a log of the switch device 500 side, particularly, a log of the external port 600-3 connected to the IO device 400-1 are overwritten and lost as described above, only the trace log in the channel 300-1 is used as the error log of the IO interface in the ICC analysis.
However, in the past, when it is difficult to specify a suspicious point based on the trace log in the channel 300-1 and there is no reproducibility of an error, all of devices and cables of a suspicious point become replacement targets. For example, in the example illustrated in FIGS. 20 and 21, many devices and cables such as the channel 300-1, a cable between the channel 300-1 and the external port 600-1, the switch device 500, a cable between the external port 600-3 and the IO device 400-1, and the IO device 400-1 become replacement targets.
When there are many suspicious points, the number of replacement parts increases, and a part replacement time increases with the increase in the number of replacement parts, and thus the cost for a part and a working time increases. Further, it takes a long time to recover a failure.
Practically, the demands for a reduction in the cost and a failure recovery time have increased, and in order to reduce the number of replaced suspicious parts, a replacement working time and the cost, and perform the recovery in a short time, it is desirable to collect an error log useful for error analysis for specifying a suspicious point.