Technical Field of the Invention
The present invention generally relates to a method of data delivery across a network and in particular to a method of minimising the effects of congestion in multi-path networks and to a multi-path network implementing the method. The method and multi-path network are suitable for use in, but not limited to, multi-processor networks such as storage networks, data centres and high performance computing. In particular, the present invention is suited for use in bridges, switches, routers, hubs and similar devices including Ethernet devices adapted for the distribution of standard IEEE 802 data frames or data frames meeting future Ethernet standards.
Protocol Layers
Conceptually, an Ethernet network is decomposed into a number of virtual layers in order to separate functionality. The most common and formally standardised model used is the Open Systems Interconnect (OSI) reference model. A useful article that describes in detail the OSI reference model is “OSI Reference Model—The ISO Model of Architecture for Open Systems Interconnection” by Hubert Zimmermann, IEEE Transactions on Communications, Vol. COM-28, No. 4, April 1980. The OSI reference model comprises seven layers of network system functionality, as follows:                1. Physical Layer is responsible for physical channel access. It consists of those elements involved in transmission and reception of signals, typically line drivers and receivers, signal encoders/decoders and clocks.        2. Data Link Layer provides services allowing direct communication between end-station devices over the underlying physical medium. This layer provides Framing, separating the device messages into discrete transmissions or frames for the physical layer, encapsulating the higher layer packet protocols. It provides Addressing to identify source and destination devices. It provides Error Detection to ensure that corrupted data is not propagated to higher layers.        3. Network Layer is responsible for network-wide communication, routing packets over the network between end-stations. It must accommodate multiple Data Link technologies and topologies using a variety of protocols, the most common being the Internet Protocol (IP).        4. Transport Layer is responsible for end-to-end communication, shielding the upper layers from issues caused during transmission, such as dropped data, errors and mis-ordering caused by the underlying medium. This layer provides the application with an error-free, sequenced, guaranteed delivery message service, managing the process to process data delivery between end stations. Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) are the most commonly recognised Transport Layer protocols.        5. Session Layer is responsible for establishing communications sessions between applications, dealing with authentication and access control.        6. Presentation Layer ensures that different data representations used by machines are resolved.        7. Application Layer provides generic functions that allow user applications to communicate over the network.        
For the purposes of this document we need not consider operations above the Transport Layer as the method described herein should, if well implemented, shield higher layers from issues arising in and below its scope.
Network Interconnections
A device that implements network services at the Data Link Layer and above is called a station. The Physical Layer is excluded from this definition as it is not addressable by a protocol. There are two types of station:                1. End Stations are the ultimate source and destination of network data communication across the network.        2. Intermediate Stations forward network data generated by end stations between source and destination.        
An intermediate station which forwards completely at the Data Link Layer is commonly called a Bridge; a station which forwards at the Network Layer is commonly called a Router.
Network stations attached to an Ethernet network exchange data in short sequences of bytes called packets or Protocol Data Units (PDU). PDUs consist of a header describing the PDUs destination and a body containing the payload data. In the OSI model the PDU has a distinct name at each protocol layer. A Physical Layer PDU is called a stream, at the Data Link Layer the PDU is a frame, at the Network Layer the PDU is a packet and at the Transport Layer the PDU is called a segment or message.
PDUs are encapsulated before being transmitted over the physical Ethernet hardware. Each encapsulation contains information for a particular OSI Layer, the Ethernet stream encapsulates a frame which in turn encapsulates a packet which encapsulates a message and so on. This encapsulation, containing headers and payload, is finally transmitted over the network fabric and routed to the destination.
At the Transport Layer, an associated standard, the Transmission Control Protocol (TCP), in addition to providing a simplified interface to applications by hiding the underlying PDU structure, is responsible for rearranging out-of-order PDUs and retransmitting lost data. TCP has been devised to be a reliable data stream delivery service; as such it is optimised for accurate data delivery rather than performance. TCP can often suffer from relatively long delays while waiting for out-of-order PDUs and data retransmission in extreme cases, reducing overall application performance and making it unsuitable for use where a maximum PDU transmission delay (jitter) needs to be guaranteed, in-file systems or media delivery, for example. Furthermore, at the lowest layer of the TCP/IP hierarchy, in the network access layer where PDUs are transmitted over the network, a fully compliant IEEE 802.1D standard MAC bridge joining separate networks together requires that order is preserved for source and destination pairs. PDU duplication is another cause of reduced performance in Ethernet networks. A unicast PDU whose destination route has not been learned by a network bridge will be flooded out to all routes from the bridge and will be buffered on multiple outbound ports at the same time. Network reconfiguration affecting the preferred route from a bridge to the destination can cause a duplicate PDU to be sent from a buffer after a duplicate PDU has already been sent out of the previous preferred route, both arriving at the destination. Again, the higher level TCP protocol will handle this but not without degrading overall performance.
Disordering and duplication should not occur during normal operation. These features of multi-path Ethernet networks are constrained by the Rapid Spanning Tree Protocol (RSTP) as defined by the IEEE 802.1D standard. The RSTP maintains a preferred route between bridges by disabling alternative routes, removing multiple paths and loops, leaving a single path that ensures in-order frame delivery.
A RSPT enforced, single path Ethernet network, often referred to as Static Routing, performs well under light network traffic load and for symmetric network traffic patterns. However it starts to fail as the network traffic load increases and the number of network connected devices increase in number and performance. This is particularly the case where communications between sources and destinations are not well ordered spatially. Many PDUs being sent concurrently across the network for different destinations will have to use the same route within the network. For some network patterns this can be particularly unfortunate for the performance of the system as a whole due to the saturation of this single route and the congestion it ultimately suffers from.
Some of the internal links of the network will remain unused whilst others will be required to transport more than one connection, slowing down the PDUs considerably. Thus, static routes on random traffic patterns not only reduce the total network bandwidth, they can leave some egress ports starved of nearly all output bandwidth while others egress ports are relatively unaffected. If the network is supporting a single large application then this can have the effect of slowing the application performance down to the rate of the slowest egress port.
As an alternative to static routing, networks may also employ so called Adaptive Routing, which allows more of the possible paths across the network to be used and theoretically improving the total bandwidth when all the ports into the network are busy taking data. However, adaptive routing still has problems; it does not necessarily find all the unused links. It will be likely to find all the links in the early stages of the network where choosing any link is taking the PDUs closer to the destination. However later stages, closer to the destination, are still probably going to find collisions between different PDUs. This is because the PDUs usually do not have an alternative route as the data nears the destination and must choose a specific link in order to reach the destination. To make the problem worse it is a requirement of most networks that all the PDUs must arrive in the original order they were sent. Adaptive routing opens up the possibility of one PDU overtaking another, misordering the data or causing duplication.
With the expansion of Ethernet networks, congestion has become a major issue, increasingly impacting networks and preventing many from ever reaching their designed performance goals. The network becomes clogged with data as an ever-increasing number of users, applications and storage devices exchange information. Congestion causes extreme degradation of data centre servers, resulting in under-utilisation of a company's expensive computing resources, often by as much as 50%. This condition will get much worse as networks get faster, with more connected devices distributed over larger geographical areas. The result will be even more wasted resource, time, money and opportunity.
Congestion can arise at any point in a multi-path network when data from more than one intermediate station converges on a single link for onward transmission. This style of communication is common in HPC and data center applications running on server clusters, it is also present when applications use network attached storage. In this latter context congestion also introduces another recognised issue, that of jitter in which the message delivery period becomes unpredictable. Congestion is an application performance killer; in a simple network delay and jitter prevent a system reaching peak performance levels. In complex networks, congestion can also necessitate the lengthier retransmission of data because intermediate stations between the endpoints simply discard or drop blocked traffic, reducing performance further. In practice, congestion spreads from the originating hot-spot until it backs up over the entire network resulting in un-associated routes being affected by a point of congestion in another part of the network. This is illustrated in the simple network diagram of FIG. 1.
FIG. 1 illustrates schematically a simplified conventional network. The rectangles on the left and right represent ingress and egress ports 2 respectively. The circles represent network crossbars 1 and the lines represent the interconnecting links, over which PDUs will traverse the network. In this example each network crossbar 1 has only three input ports and three output ports 2. Typically network crossbars have many more ports than this and this mechanism works equally well with higher arity crossbars. An example of a conventional network crossbar 1 is illustrated in FIG. 2.
It can be seen In FIG. 1 that the total available bandwidth remains constant across the network; there being nine links available for transporting data at all stages across the network. As can be seen, the network of FIG. 1 has three stages of switching. Two stages of switching would be the minimum to complete a path from any ingress port to any egress port using the network crossbars 1. However this would result in poor performance for some traffic patterns. For example, if ingress port A was transmitting to egress port R while port B was sending to port S and port C was sending to port T then all three traffic flows would have to share a single link.
When a third stage of switching is added, for a random set of connections from the ingress ports to the egress ports, now a set of routes become available enabling all the connections to operate at full bandwidth. The problem is working out the non-contending set of routes required to make this possible. In the illustrated example, a simple approach would be to choose a random route out of the first switching stage that was on a link not being used by another traffic flow. This form of adaptive routing usually improves the expected total throughput for a saturated network traffic pattern but it is not controlled and could still easily result in some idle links and some over committed links between the second and third stages of switching.
Adaptive routing is effective if the traffic pattern is continually changing. Even if the initial adaptive guess is wrong, provided the network crossbars have a reasonable amount of buffering, the next adaptive choice of output is likely to be better. The continually changing output choice is going to provide new data to fill the under-utilised links and temporary output blocking can be held in the input buffers of the network crossbars until the network traffic pattern changes to allow it to proceed.
Adaptive routing is less good at coping with a random set of communications that remain busy but unchanging for a long period. Here the initial guess is critical to the final total bandwidth of the network. If the initial guess is wrong then all the data could be serialized along only 3 of the 9 possible links delivering only ⅓ of the total network bandwidth. If the network has higher arity crossbars the problem can be much worse. If the crossbars have an arity of 64 then it can be as bad as only delivering 1/64 of the total network bandwidth. It is very common for network traffic patterns to be random in their connections but remain constant in the flow of data. For example a TCP/IP stream established between a client and server providing a full duplex data connection can have a very high bandwidth requirement that might be sustained for a long duration. The data stream is split into many separate PDUs, and these are sent one after another from the same ingress to egress ports of the network. Another example is an RDMA stream. A large block of data, perhaps hundreds of megabytes, is sent from one ingress port to another egress port again split into many separate PDUs.
When a network becomes congested, blocked traffic is simply thrown away by the switches in an attempt to reduce the immediate network load, hoping that the congested point will eventually clear. The TCP/IP layer in the sending device will retransmit the data after a timeout. This is disastrous for system performance, at best it greatly increases latency and significantly reduces throughput. If the congestion does not clear quickly an entire network can completely collapse and become incapable of transmitting any traffic.
Congestion will get much worse as networks continue to become larger, faster and denser, with more connected end stations distributed over larger geographical areas. Removing congestion or at least minimising the effects of congestion allows full, sustained use of data center services enabling companies to operate more efficiently and cost effectively.
With the move to 10 Gb Ethernet, devices will connect to the network at the same speed as the interconnecting fabric. This, in turn, will remove the extra network capacity that up until now has helped reduce congestion in previous network generations.
Many higher-level protocols have been devised to try to remove the effects of congestion. They all rely on trying to control the total output bandwidth of the sources sending data into the network with the intention of bringing the input bandwidth close to but not exceeding the congestion threshold. Intermediate network stations achieve this by data flow classification and upstream notification. The inspection of the data flow and subsequent messaging to rate limit the source all takes time, adding latency and complexity. All attempt to manage congestion rather than attempting to prevent it in the first place.
To date none of the congestion management techniques are particularly successful and all ultimately rely on preventing a network from ever achieving sustained peak levels of operation. Localised endpoint congestion may occur before the steady state conditions these techniques rely on have been established and some traffic patterns are inherently so unstable with rapidly changing conditions that the traffic management algorithms are never given a chance to stabilise.
The problem with all congestion management techniques is that congestion has to be occurring before remedial action can be taken. Management at this point can benefit if the network traffic is of a single type and the data rate is constant and predictable, however the benefit is often reduced in the more complex environment of the data center where services run more diverse applications with dynamically changing data flows. In high performance networks, congestion hot-spots appear rapidly and move around the network at an incredible rate. This increases the probability of over-constraining the wrong part of the network, as the point of congestion may have moved by the time notification and subsequent action have been applied.
Once congestion is identified by a management technique, data is restricted or rate-limited at the source, preventing saturation. This limits the overall systems capabilities, preventing a service from running at sustained peak performance for fear of causing congestion.
Description of Related Art
In US 2007/0064716 a method of controlling data unit handling is described in which congestion management measures may be selectively disabled.
However, this offers no benefits in terms of preventing congestion and may indeed add to congestion problems.
In US 2006/0203730 a method of reducing end station latency in response to network congestion is described. This document proposes that in response to a congestion indicator, the introduction of new frames to a queue is prevented i.e. frames are dropped. However, as mentioned earlier this has the disadvantage that where the dropped frames form part of a large group of frames being communicated across the network, in order to ensure the frames arrive at their end station in the correct order, duplicate copies of the frames must be issued.
The present invention seeks to overcome the problems encountered with conventional networks and in particular seeks to provide a method of minimising the effects of congestion in a multi-path network and of with improving the bandwidth of the network.