Field of the Invention
The present invention concerns the routing in a cluster, that is to say the determination of communication routes between a set of nodes of the cluster, and more particularly a method of optimizing routing in a cluster comprising static communication links and a computer program implementing that method.
Description of Related Technology
High Performance Computing (HPC) is being developed for university research and industry alike, in particular in technical fields such as aeronautics, energy, climatology and life sciences. Modeling and simulation make it possible in particular to reduce development costs and to accelerate the placing on the market of innovative products that are more reliable and consume less energy. For research workers, high performance computing has become an indispensable means of investigation.
This computing is generally conducted on data processing systems called clusters. A cluster typically comprises a set of interconnected nodes. Certain nodes are used to perform computing tasks (compute nodes), others for storing data (storage nodes) and one or more others manage the cluster (administration nodes). Each node is for example a server implementing an operating system such as Linux (Linux is a trademark). The connection between the nodes is, for example, made using Ethernet or Infiniband communication links (Ethernet and Infiniband are trademarks).
FIG. 1 is a diagrammatic illustration of an example of a topology 100 for a cluster, of fat-tree type. The latter comprises a set of nodes of general reference 105. The nodes belonging to the set 110 are compute nodes here whereas the nodes of the set 115 are service nodes (storage nodes and administration nodes). The compute nodes may be grouped together in sub-sets 120 called compute islands, the set 115 being called a service island.
The nodes are linked together by switches, for example hierarchically. In the example illustrated in FIG. 1, the nodes are connected to first level switches 125 which are themselves linked to second level switches 130 which in turn are linked to third level switches 135.
As illustrated in FIG. 2, each node generally comprises one or more microprocessors, local memories and a communication interface. More specifically, the node 200 here comprises a communication bus 202 to which there are connected:                central processing units (CPUs) or microprocessors 204;        components of random access memory (RAM) 206, comprising registers adapted to record variables and parameters created and modified during the execution of programs (as illustrated, each random access memory component may be associated with a microprocessor); and,        communication interfaces 208 adapted to send and to receive data.        
The node 200 furthermore possesses here internal storage means 212, such as hard disks, able in particular to contain the executable code of programs.
The communication bus allows communication and interoperability between the different elements included in the node 200 or connected to it. The microprocessors 204 control and direct the execution of the instructions of portions of software code of the program or programs. On powering up, the program or programs which are stored in a non-volatile memory, for example a hard disk, are transferred into the random access memory 206.
It is observed here that the performance of a cluster is directly linked to the quality of the routes enabling the transfer of data between the nodes, that are established via communication links. In general terms, physical communication links are established between the nodes and the switches at the time of the hardware configuration of a cluster, the communication routes themselves being determined in an initialization phase on the basis of a definition of the connections to be established between the nodes. According to the communication technology implemented, the configuration of the routes may be static or dynamic.
By way of illustration, the Infiniband technology enables, in a cluster, a static configuration of the routes. This configuration uses static routing tables, known as Linear Forwarding Tables (LFTs), in each switch. When this technology is implemented, a routing algorithm such as the algorithms known under the names FTree, MINHOP, UPDN and LASH may be used.
In simplified manner, the FTree algorithm determines routes such that they are distributed as much as possible through existing communication links. For these purposes, at the time of the routing of a communication network fully connected in accordance with an FTree type architecture, each node of the network is considered as having the same importance. Thus, when a route is established between two nodes of the same link, the number of routes using that link, called the load of the link, is increased by one. When the routing algorithm seeks to establish a new route and there are several possibilities, it compares the load levels associated with the links on which those possibilities are based and chooses the one whose links have the lowest load level.
The routing quality may be expressed in terms of the number of routes per link.
FIG. 3, comprising FIGS. 3a to 3e, illustrates this routing principle in a switch 300 at the time of an initialization phase of a cluster comprising that switch.
The switch 300 here has four input communication links, denoted 310-1 to 310-4, linking the switch 300 to inputs 305-1 to 305-4 and two output communication links, denoted 320-1 and 320-2, linking the switch 300 to outputs 315-1 and 315-2. Prior to initialization, none of the links 310-1 to 310-4, 320-1 and 320-2 comprises any route. The load levels associated with those links are thus zero as illustrated in FIG. 3a beside each link. Then, when a route is to be established between the input 305-1 and an output of the switch 300, the link 310-1 (the only one able to be used) is selected as well as the link 320-1 (as the load levels associated with the links 320-1 and 320-2 are, here, equal to zero, the first link is selected). The load levels associated with the links 310-1 and 320-1 are then incremented by one to indicate that those links are implementing an additional route, as illustrated in FIG. 3b. 
In the same way, when a route is to be established between the input 305-2 and an output of the switch 300, the link 310-2 (the only one able to be used) is selected as well as the link 320-2 (as the load level associated with the link 320-1 is equal to one and the load level associated with the link 320-2 is equal to zero, the latter link is selected). The load levels associated with the links 310-2 and 320-2 are then incremented by one to indicate that those links are implementing an additional route, as illustrated in FIG. 3c. In similar manner, when a route is to be established between the input 305-3 and an output of the switch 300, the link 310-3 (the only one able to be used) is selected as well as the link 320-1 (as the load levels associated with the links 320-1 and 320-2 are equal, the first link is selected). The load levels associated with the links 310-3 and 320-1 are then incremented by one to indicate that those links are implementing an additional route, as illustrated in FIG. 3d. 
Lastly, when a route is to be established between the input 305-4 and an output of the switch 300, the link 310-4 (the only one able to be used) is selected as well as the link 320-2 (as the load level associated with the link 320-1 is equal to two and the load level associated with the link 320-2 is equal to one, the latter link is selected). The load levels associated with the links 310-4 and 320-2 are then incremented by one to indicate that those links are implementing an additional route, as illustrated in FIG. 3e. When all the routes between the nodes have been established, the static routing tables for the switches are updated.
However, although these routing algorithms give good results, they are not optimal.