The term Data Centers (DC) generally refers to facilities used to house large computer systems (often contained on racks that house the equipment) and their associated components, all connected by an enormous amount of structured cabling. Cloud Data Centers (CDC) is a term used to refer to large, generally off-premise facilities that similarly store an entity's data.
Network switches are computer networking apparatus that link network devices for communication/processing purposes. In other words, a switch is a telecommunication device that is capable of receiving a message from any device connected to it, and transmitting the message to a specific device for which the message is to be relayed. A network switch is also commonly referred to as a multi-port network bridge that processes and routes data. Here, by port, we are referring to an interface (outlet for a cable or plug) between the switch and the computer/server/CPU to which it is attached.
Today, DCs and CDCs generally implement data center networking using a set of layer two switches. Layer two switches process and route data at layer 2, the data link layer, which is the protocol layer that transfers data between nodes (e.g. servers) on the same local area network or adjacent nodes in a wide area network. A key problem to solve, however, is how to build a large capacity computer network that is able to carry a very large aggregate bandwidth (hundreds of TB) containing a very large number of ports (thousands), that requires minimal structure and space (i.e. minimizing the need for a large room to house numerous cabinets with racks of cards), and that is easily scalable, and that may assist in minimizing power consumption.
The traditional network topology implementation is based on totally independent switches organized in a hierarchical tree structure as shown in FIG. 1. Core switches 2 are very high speed, low count ports with a very large switching capacity. The second layer is implemented using aggregation switches 4, medium capacity switches with a larger number of ports, while the third layer is implemented using lower speed, large port count (e.g. forty/forty-eight), low capacity Edge switches 6. Typically the edge switches are layer 2, whereas the aggregation ports are layer 2 and/or 3, and the core switch is typically layer 3. This implementation provides any server 8 to any server connectivity in a maximum of six hop links in the example provided (three hops up to the core switch 2 and three down to the destination server 8). Such a hierarchical structure is also usually duplicated for redundancy-reliability purposes. For example, with reference to FIG. 1, without duplication if the right-most edge switch 6 fails, then there is no connectivity to the right-most servers 8. In the least, core switch 2 is duplicated since the failure of the core switch 2 would generate a total data center connectivity failure. For reasons that are apparent, this method has significant limitations in addressing the challenges of the future CDC. For instance, because each switch is completely self-contained, this adds complexity, significant floor-space utilization, complex cabling and manual switches configuration/provisioning that is prone to human error, and increased energy costs.
Many attempts have been made, however, to improve switching scalability, reliability, capacity and latency in data centers. For instance, efforts have been made to implement more complex switching solutions by using a unified control plane (e.g. the QFabric System switch from Juniper Networks; see, for instance, http://www.juniper.net/us/en/productservices/switching/qfabric-system/), but such a system still uses and maintains the traditional hierarchical architecture. In addition, given the exponential increase in the number of system users and data to be stored, accessed and processed, processing power has become the most important factor when determining the performance requirements of a computer network system. While server performance has continually improved, one server is not powerful enough to meet the needs. This is why the use of parallel processing has become of paramount importance. As a result, what was predominantly north-south traffic flows, has now primarily become east-west traffic flows, in many cases up to 80%. Despite this change in traffic flows, the network architectures haven't evolved to be optimal for this model. It is therefore still the topology of the communication network (which interconnects the computing nodes (servers)) that determines the speed of interactions between CPUs during parallel processing communication.
This need for increased east-west traffic communications led to the creation of newer, flatter network architectures, e.g. toroidal/torus networks. A torus interconnect system is a network topology for connecting network nodes (servers) in a mesh-like manner in parallel computer systems. A torus topology can have nodes arranged in 2, 3, or more (N) dimensions that can be visualized as an array wherein processors/servers are connected to their nearest neighbor processors/servers, and wherein processors/servers on opposite edges of the array are connected. In this way, each node has 2N connections in a N-dimensional torus configuration (FIG. 2 provides an example of a 3-D torus interconnect). Because each node in a torus topology is connected to adjacent ones via short cabling, there is low network latency during parallel processing. Indeed, a torus topology provides access to any node (server) with a minimum number of hops. For example, a four dimension torus implementing a 3×3×3×4 structure (108 nodes) requires on average 2.5 hops in order to provide any to any connectivity. FIG. 4 provides an example of a 6×6 2-D torus, showing the minimum number of hops required to go from corner node 1.6 to all other 35 nodes. As shown, the number of hops required to reach any destination from node 1.6 can be plotted as a bell-curve with the peak at 3 hops (10 nodes) and tails generally at 5 hops (4 nodes) and 1 hop (4 nodes), respectively. Unfortunately, large torus network implementations have not been practical for commercial deployment in DCs or CDCs because large implementations can take months to build, cabling can be complex (2N connections for each node), and they can be costly and cumbersome to modify if expansion is necessary. However, where the need for processing power has outweighed the commercial drawbacks, the implementation of torus topologies in supercomputers has been very successful. In this respect, IBM's Blue Gene supercomputer provides an example of a 3-D torus interconnect network wherein 64 cabinets house 65,536 nodes (131,072 CPUs) to provide petaflops processing power (see FIG. 3 for an illustration), while Fujitsu's PRIMEHPC FX10 supercomputer system is an example of a 6-D torus interconnect housed in 1,024 racks comprising 98,304 nodes. While the above examples dealt with a torus topology, they are equally applicable to other flat network topologies.
The present invention deals more specifically with the important issue of data packet traverse and routing from node to node in torus or higher radix network structures. In this respect, it is the routing that determines the actual path that packets of data take to go from source to destination in the network. For the purposes herein, latency refers to the time it takes for a packet to reach the destination in the network, and is generally measured from when the head arrives at the input of the source node to when it arrives at the input of the destination node. Hop count refers to the number of links or nodes traversed between the source and the destination, and represents an approximation for determining latency. Throughput is the data rate that the network accepts per input port/node measured in bits/sec.
A useful goal when routing is to distribute the traffic evenly among the nodes (load balancing) so as to avoid hotspots development (a pathway or node region where usage/demand has exceeded a desired or acceptable threshold) and to minimize contention (when two or more nodes attempt to transmit a message or packet across the same wire or path at the same time), thereby improving network latency and throughput. The route chosen therefore affects the number of hops from node to node, and may potentially even thereby affect energy consumption when the route is not optimized.
The topology under which a network operates also clearly affects latency because topology impacts the average minimum hop count and the distance between nodes. For instance, in a torus, not only are there several paths that a packet can take to reach a destination (i.e. in a torus there is “path diversity”), but there are also multiple minimum length paths between any source and destination pair. As an example, FIG. 5 shows examples of three minimal routes that a packet can take to go from node S 11 to node D 12 (3 hops) in a 2-D torus mesh, while a longer fourth route of 5 hops is also shown. The paths computation done through routing is done statically based on topology only—the source routing is dynamically based on packet source-destination pairs on hop by hop basis.
Routing methodologies that exploit path diversity have better fault tolerance and better load balancing in the network. Routing methodologies do not always achieve such goals, however, and can generally be divided into three classes: deterministic, oblivious and adaptive. Deterministic routing refers to the fact that the route(s) between a given pair of nodes is determined in advance without regard to the current state of the network (i.e. without regard to network traffic). Dimension Order Routing (DOR) is an example of deterministic routing, wherein all messages from node A to node B will always traverse the same path. Specifically a message traverses dimension-by-dimension (X-Y routing), thereby reaching the ordinate matching its destination in one dimension before switching to the next dimension. As an example, FIG. 6 can be used to show DOR, wherein a packet firstly travels along a first dimension (X) as far as required from node 1 to 5 to 9, followed by travelling along the second dimension (Y) to destination node 10. Although such routing is generally easy to implement and deadlock free (deadlock refers to a situation where an endless cycle exists along the pathway from source to destination), there is no exploitation of path diversity and therefore poor load balancing.
Routing algorithms that are “oblivious” are those wherein routing decision are made randomly without regard to the current state of the network (deterministic routing is a subset of oblivious routing). Although this means oblivious routing can be simple to implement, it is also unable to adapt to traffic and network circumstances. An example of a well-known oblivious routing method is the Valiant algorithm (known to persons skilled in the art). In this method, a packet sent from node A to node B is first sent from A to a randomly chosen intermediate node X (one hop away), and then from X to B. With reference again to FIG. 6, a Valiant algorithm could randomly chose node 2 as an intermediate node from source node 1 to destination node 10, meaning the path 1-2-3-7-11-10, for instance, could be used for routing purposes. This generally randomizes any traffic pattern, and because all patterns appear to be uniformly random, the network load is fairly balanced. In fact, the Valiant algorithm has been thought to be able to generally balance load for any traffic patterns on almost any topology. One problem however, is that Valiant routes are generally non minimal (minimal routes are paths that require the smallest number of hops between source and destination), which often results in a significant hop count increase, and which further increases network latency and may potentially increase energy consumption. There are exceptions, however, such as in a network where congestion is minimal. Non-minimal routing can significantly increase latency and perhaps power consumption as additional nodes are traversed; on the other hand, in a network experiencing congestion, non-minimal routing may actually assist in avoiding nodes or hotspots, and thereby actually result in lower latency.
If minimal routes are desired or necessary though, the Valiant algorithm can be modified to restrict its random decisions to minimal routes/shortest paths by specifying that the intermediate mode must lie within a minimal quadrant. As an example, with reference to FIG. 6, the minimal quadrant for node 1 (having a destination of node 10) would encompass nodes 2, 5, and 0, and would lead to a usual hop count of 3. The Valiant algorithm provides some level of path selection randomization that will reduce the probability of hot spots development. There is a further conundrum, however, with affecting such efficiencies—Valiant routing is only deadlock free when used in conjunction with DOR routing, which itself fails with load balancing and hot spots avoidance.
Adaptive routing algorithms are those wherein routing decisions are based on the state of the network or network traffic. They generally involve flow control mechanisms, and in this respect buffer occupancies are often used. Adaptive routing can employ global node information (which is costly performance-wise), or can use information from just local nodes, including, for instance, queue occupancy to gauge network congestion. The problem with using information solely from local nodes is that this can sometimes lead to suboptimal choices. Adaptive routing can also be restricted to minimal paths, or it can be fully adaptive (i.e. no restrictions on taking the shortest path) by employing non-minimal paths with the potential for livelock (i.e. a situation similar to deadlock where the packet travel is not progressing to destination; often the result of resource starvation). This can sometimes be overcome by allowing a certain number of misroutes per packet, and by giving higher priority to packets misrouted many times. Another problem with adaptive routing is that it may cause problems with preserving data packet ordering—the packets need to arrive at the destination in the same order or otherwise you need to implement packet reordering mechanisms.
Lastly, it is important to mention that routing can be implemented by source tables or local tables. With source tables, the entire route is specified at the source, and can be embedded into the packet header. Latency is thereby minimized since the route does not need to be locked up or routed hop by hop at each node. Source tables can also be made to specify multiple routes per destination to be able to manage faults, and, when routes are selected randomly (i.e. oblivious routing), to increase load balancing. With local node tables, on the other hand, smaller routing tables are employed. However, the next step a packet is to take is determined at each node, and this adds to per hope latency.
The present invention seeks to overcome deficiencies in the prior art and improve upon known methods of routing packets in torus or higher radix network topologies.