In a large-scale cluster network, a multi-path network having a plurality of routes between any two nodes, such as a fat-tree network or torus network, is used. Main objects to use a multi-path network are: (1) to obtain high performance, specifically, to obtain high throughput by distributing a load to a plurality of routes and (2) to avoid trouble, specifically, to avoid the inability to communicate in case of a failure occurrence in a route by using an alternative route.
Recently, technology referred to as InfiniBand has been developed as cluster network technology. A network in which InfiniBand is used will be referred to below as the InfiniBand network.
FIG. 1 is a diagram illustrating an example of an InfiniBand network system having a fat-tree structure. In FIG. 1, switches SP1 to SP3, switches LF1 to LF3, and compute nodes N1 to N9 are connected through an InfiniBand network so as to form a fat tree.
Compute nodes N1 to N9 are computers that perform prescribed operations by communicating with each other via switches LF1 to LF3 and switches SP1 to SP3. The numbers 1 to 9 enclosed in the circles are addresses (local identifiers (LIDs)) assigned to compute nodes N1 to N9, respectively. With the InfiniBand network, an address assigned to each of compute nodes N1 to N9 is represented by a 16-bit value.
Switches LF1 to LF3 are so-called leaf switches, each of which is directly connected to some of compute nodes N1 to N9. Switches SP1 to SP3 are so-called spine switches that couples leaf switches LF1 to LF3 each other. Numbers depicted beside each of ports of switches indicates addresses assigned to compute nodes to which packets having the addresses as destination addresses are to be forwarded via the each port serving as an output port. For example, when a packet having a destination address of 4 arrives at switch SP1, the packet is forwarded to switch LF2 via a port coupled to leaf switch LF2. In the following description, for the ease of explanation, a port serving as an output port will be also expresses simply as “an output port”, a compute node address depicted beside each of ports of switches will be also expressed as “a destination address” implying a destination address of a packet to be forwarded from the each port to the corresponding compute node. In this way, destination addresses are associated with each of output ports of switches. In FIG. 1, labels “a”, “b”, and “c” indicate port numbers assigned to ports provided for each of leaf switches. In the case, the same port numbers “a”, “b”, and “c” are assigned to ports of each of the leaf switches.
“(To: x, y, z)” described for each of spine switches SP1 to SP 3 indicates that a packet having a destination address of x, y, or z is relayed by the each spine switch. For example, a packet having a destination address of 1, 4, or 7 is forwarded via spine switch SP1, a packet having a destination address of 2, 5, or 8 is forwarded via spine switch SP2, and a packet having a destination address of 3, 6, or 9 is forwarded via spine switch SP3, as illustrated in FIG. 2. When a leaf switch receives a packet destined for a compute node directly connected to the leaf switch, the packet is directly forwarded to the compute node without forwarding to a spine switch. For example, when leaf switch LF1 receives a packet having a destination address of 1, 2, or 3, the packet is directly forwarded to compute nodes N1, N2, or N3, respectively, without being forwarded to any spine switch.
In FIG. 1, when any one of spine switches has failed and communication thereof has been disabled, a route switchover is made to bypass the failed spine switch.
FIG. 2 is a schematic diagram illustrating an example of a route switchover when a spine switch has failed.
FIG. 2 illustrates a case in which spine switch SP1 has failed, as depicted by dashed lines. In this case, in leaf switches LF1, LF2, and LF3, the destination addresses (depicted in dotted rectangular) that have been associated with the ports that were coupled to spine switch SP1 are newly associated with other ports. In FIG. 2, the newly associated destination addresses are underlined. For example, in leaf switch LF1, destination address 4 is newly associated with the port coupled to spine switch SP2, and destination address 7 is newly associated with the port coupled to spine switch SP3. As a result, a packet having destination address 4 or 7 may be forwarded without passing through the failed spine switch SP1.
In a conventional InfiniBand network system, consecutive addresses are assigned to a group of compute nodes connected to the same leaf switch. That is, after first consecutive addresses have been assigned to a first group of compute nodes directly connected to a first leaf switch, second consecutive addresses following the first consecutive addresses are assigned to a second compute nodes directly connected to a next leaf switch. For example, in FIG. 1, first consecutive addresses 1 to 3 are assigned to first compute nodes directly connected to first leaf switch LF1, and second consecutive addresses 4 to 6 are assigned to second compute nodes directly connected to second leaf switch LF2.
The N-th port from the left on the upper side of each leaf switch is coupled to the N-th spine switch from the left. In FIG. 1, for example, a port having port number “a” in each leaf switch is coupled to spine switch SP1, a port having port number “b” in each leaf switch is coupled to spine switch SP2, and a port having port number “c” in each leaf switch is coupled to spine switch SP3. In other words, the N-th spine switch from the left binds the N-th ports from the left on the upper side of all the leaf switches. That is, the N-th spine switch from the left relays packets having destination addresses associated with the N-th ports from the left on the upper side of all the leaf switches.
According to the above mentioned connection configuration and setting of communication routes, a plurality of compute nodes directly connected to a first leaf switch are able to concurrently communicate with the other compute nodes each directly connected to one of leaf switches other than the leaf switch, using different communication routes. That allows a network load to be appropriately distributed. For example, suppose that communication between compute nodes N1 and N4 (communication 1), communication between compute nodes N2 and N5 (communication 2), and communication between compute nodes N3 and N6 (communication 3) are performed concurrently. Then, communication 1 is performed via spine switch SP1, communication 2 is performed via spine switch SP2, and communication 3 is performed via spine switch SP3.
FIG. 3 is a diagram illustrating an example of an InfiniBand network system having a fat-tree structure. The network depicted in FIG. 3 is larger in size than the network in FIG. 1. In FIG. 3, labels “a” to “p” indicate port numbers assigned to ports provided for each of leaf switches. In the case, the same port numbers “a” to “p” are assigned to ports of each of the leaf switches, in a manner similar to the case of FIG. 1.
For example, the network in FIG. 3 includes 512 compute nodes N1 to N512, 32 leaf switches LF1 to LF32 each of which has 32 ports, and 16 spine switches SP1 to SP16 each of which has 32 ports. “(To: n mod 16)” described for each spine switch indicates that a packet having a destination address that produces “n” as the remainder when divided by 16 is relayed by the each spine switch. For example, a packet having a destination address that produces 1 as the remainder when divided by 16 passes through spine switch SP1, and a packet having a destination address that produces 2 as the remainder when divided by 16 passes through spine switch SP2. Similarly, a packet having a destination address that produces 0 as the remainder when divided by 16 passes through spine switch SP16. In the network of FIG. 3 as well, when any one of spine switches has failed and communication has been disabled, a route switchover is made to bypass the failed spine switch.
FIG. 4 is a diagram illustrating an example of a route switchover when a spine switch has failed.
FIG. 4 illustrates a case where there exists a fault occurring in spine switch SP1, as depicted by dashed lines in FIG. 4. In this case, in leaf switches LF1 to LF32, the destination addresses that have been associated with ports each having port number “a” and being coupled to spine switch SP1 (that is, the destination addresses each producing 1 as the remainder of the division by 16) need to be reassigned to other ports each having one of port numbers “b” to “p”.
Each switch has a forwarding database (FDB) that stores route information indicating correspondence between destination addresses and port numbers assigned to output ports.
FIG. 5 is a diagram illustrating a configuration example of a FDB included in an InfiniBand switch. In the FDB of FIG. 5, the letters “a” to “p” indicate port numbers, and a destination address associated with a port number is identified by adding a value that corresponds to the port number in the row labeled with “OFFSET”, to a value that corresponds to the port number in the column labeled with “BASE”. For example, a packet having a destination address of 1 (0+1) needs to be sent from an output port having port number “a”. Similarly, it is found that a packet having a destination address of 67 (64+3) needs to be sent from an output port having port number “c”. Although a FDB may be implemented as one-dimensional information, FIG. 5 represents the FDB in a two-dimensional form to facilitate understanding. In the following description, an output port having a port number “x” will be also expressed as “port x” for ease of explanation.
When a fault has occurred in spine switch SP1 as illustrated in FIG. 4, output ports having port number “a” and being connected to spine switch SP1 are no longer used in each leaf switch. In FIG. 5, therefore, the destination addresses assigned to ports having port number “a” need to be reassigned to ports having port numbers other than “a”. That is, portions enclosed by the dashed-line rectangles 21, 22, 23, and 24 in FIG. 5, which correspond to output ports having port number “a”, need to be changed. In the FDB in the InfiniBand switch, however, 64 destination addresses are processed as one data block that is updated at a time. That is, data included in a row enclosed by a heavy line in FIG. 5 is updated at a time as one data block. When a fault has occurred in spine switch SP1, therefore, all the data blocks 31, 32, 33, and 34 in the FDB need to be updated. This is also the case when a fault has occurred in a spine switch other than spine switch SP1.
For example, it takes about 400 μs to about 1000 μs in updating one data block of the FDB. With a large network, this update time becomes so large that it cannot be negligible. For example, a system including 200 to 4000 compute nodes has been already constructed, and a system including more than 10,000 compute nodes, as illustrated in FIG. 6, will be constructed in the near future.
FIG. 6 is a diagram illustrating an example of an InfiniBand network system having a fat-tree structure. In FIG. 6, an InfiniBand network includes 18 spine switches each having 648 ports, 648 leaf switches each having 36 ports, and 11,664 compute nodes. When any one of spine switches has failed in FIG. 6, it is expected that about 47 seconds to about 118 seconds are needed for updating the FDB in each leaf switch. These time values are very large ones in terms of time acceptable to network communication.
Japanese Laid-open Patent Publication No. 2005-333220 discloses technology that uses an address conversion table to simplify the update of route information.
However, when an address conversion table is used, addresses assigned to the same entry in the address conversion table are always assigned to the same port. This is problematic in that the addresses cannot be easily reassigned to different ports to distribute a load.