Field of the Invention
The present invention relates to high performance computing (HPC) and more particularly to routing communication loads in an HPC network
Description of the Related Art
HPC relates to the use of large numbers of processors in order to process a large computational task in a short period of time. HPC generally implies one of two different architectural approaches. One approach modeling distributed computing utilizes a large number of discrete computers distributed across a network each devoting some or all of the available processing time to solving a common problem. In this regard, each individual computer receives and completes many small tasks, reporting the results to a central server which integrates the task results from all the clients into the overall solution.
A different approach utilizes a large number of dedicated processors placed in close proximity to one other such as in a computing cluster. The placement of the processors in a cluster saves considerable time moving data around and makes it possible for the processors to work together rather than on separate tasks as in the former approach. In this regard, in the computing cluster, the different processors communicate data with one another not over mere switched network links, but over more freeform mesh or hypercube fabric communicative substrates.
The communication fabric utilized in HPC is both highly scalable and also operates at low latency with very high transfer rates. The HPC communicative fabric differs from a traditional networking topology, such as Ethernet, in that the HPC communicative fabric generally does not offer intelligence on the switch side to determine routing of data. Rather, the HPC communicative fabric provides many possible routes between different processors. As such, each node in the HPC network via a fabric or subnet manager discover and are aware of the multiple routes to each other node in the fabric.
Notwithstanding, it is possible for congestion to arise in some paths in the HPC communicative fabric while other paths are underutilized since all nodes will use the same route determination algorithm determining the shortest path to the end destination. Consequently, when the HPC fabric is not able to handle a communication load, packets are delayed and must wait for resources to be released causing a temporary but extensive performance deficiency condition in the HPC fabric. Current attempts to address the congestion that arises in the HPC fabric involve variations of adaptive routing utilizing an intelligent switch, or payload distribution utilizing multiple different paths. Even still, congestion can still occur using either scheme.