Typically, a super computer has a configuration in which a large number of calculators called nodes are coupled with each other through a network called interconnect. Communication through the interconnect is controlled by an interconnect control unit in each node. The interconnect control unit is also called an interconnect controller (ICC).
Recently, the processing performance of calculators has been significantly improved by highly improved performance of central processing units (CPUs). This has led to increase in the amount of data communicated between CPUs, and accordingly, a bandwidth desired for the interconnect has been increasing. It is difficult to obtain the desired bandwidth by electrical communication through metal wires, and thus the interconnect is increasingly achieved by optical communication, which provides a large bandwidth. The optical communication is achieved by using a conversion element configured to convert light and electricity, which is called an optical module. The optical module is roughly divided into two parts, a circuit part configured to communicate an electric signal with the interconnect control unit, and an optical element part configured to convert optical and electric signals.
A path through which nodes are coupled is called a link. Typically, one link includes a plurality of lanes as communication paths through which signals are transmitted and received. The interconnect control unit is provided with ports in a number equal to the number of links, and the ports are coupled with nodes different from each other.
The interconnect control unit has functionality called dynamic lane degeneracy. The dynamic lane degeneracy is functionality of cutting off, when failure is detected at a certain link, the problematic lane in the link and continuing communication operation by using any lane in order. For example, consider a case in which failure occurs at a light receiving element used by a particular link. In this case, the interconnect control unit detects an error such as excess of the number of times of packet retransmission over a defined value at the particular link. Having detected such an error, from which it is determined to be difficult to continue communication, the interconnect control unit executes lane degeneracy on the particular link. At execution of the lane degeneracy, the interconnect control unit determines which lane is to be cut off by using an error counter prepared for each lane. Specifically, the interconnect control unit compares count values of lanes and the values of the error counters, and determines a cutoff target to be any lane for which a larger number of errors are detected. Then, when a particular lane is cut off, the interconnect control unit executes link re-initialization to, for example, activate any lane other than the lane cut off.
For example, consider a case in which a particular link includes two lanes. When an error from which it is determined to be difficult to continue communication is detected at one of the lanes while the other lane is already degenerated, the particular link has no available lane. In this case, the interconnect control unit performs processing to deactivate on the particular link and cuts off the particular link from an in-system calculation resource.
Less research and development have been achieved in optical communication than in electrical communication, and the optical module tends to have a high failure rate as compared to any other device configured to process electric signals but not optical signals. For example, the optical module has a unique failure mode called sudden death, in which light emission from a light-emitting element suddenly is stopped. Moreover, recently, the amount of heat generation at the optical module has been increasing due to downsizing and increased density of the optical module as well as increase of communication speed in response to a request for increased interconnect transmission capacity. It is known that heat generation accelerates failure of the device, and is a factor of increase of the failure rate. For these reasons, the optical module tends to be more likely to fail than any other device, which is a main factor of the lane degeneracy and the link deactivation at interconnect.
Technologies as described below are disclosed as technologies related to such communication failure at, for example, a link or a lane. For example, in a conventional technology, the link deactivation is avoided by performing reallocation of physical and logic lanes when restriction exists on the number of logic lanes or a lane width for which degeneracy is possible. In another conventional technology, the state of lane degeneracy is resolved by using an unused physical lane. In another conventional technology, a path is divided into partial paths, failure detection is performed at each partial path, and switching is performed to a path bypassing a partial path at which failure has occurred. In another conventional technology, resources of paths are shared based on priority information provided to the paths. In another conventional technology of determining a place where failure occurs, a particular interval is specified on an optical path to perform a conduction test on a specified interval by using an optical signal. In another conventional technology, a multi-stage connection network is formed to perform communication through a bypass switch when a switch has failed. A citation list includes Japanese Laid-open Patent Publication Nos. 2005-182485, 2013-200616, 2003-258851, 11-191754, and 05-111065, International Publication Pamphlet No. WO 2008/044646.