1. Technical Field of the Invention
The present invention generally relates to a method of data delivery across a network and in particular to a method of minimising the effects of congestion in multi-path networks which use dynamic routing and to a multi-path network implementing the method. The method and multi-path network are suitable for use in, but not limited to, multi-processor networks such as storage networks, data centres and high performance computing. In particular, the present invention is suited for use in bridges, switches, routers, hubs and similar devices including Ethernet devices adapted for the distribution of standard IEEE 802 data frames or data frames meeting future Ethernet standards.
Protocol Layers
Conceptually, an Ethernet network is decomposed into a number of virtual layers in order to separate functionality. The most common and formally standardised model used is the Open Systems Interconnect (OSI) reference model. A useful article that describes in detail the OSI reference model is “OSI Reference Model—The ISO Model of Architecture for Open Systems Interconnection” by Hubert Zimmermann, IEEE Transactions on Communications, Vol. COM-28, No. 4, April 1980. The OSI reference model comprises seven layers of network system functionality, as follows:                1. Physical Layer is responsible for physical channel access. It consists of those elements involved in transmission and reception of signals, typically line drivers and receivers, signal encoders/decoders and clocks.        2. Data Link Layer provides services allowing direct communication between end-station devices over the underlying physical medium. This layer provides Framing, separating the device messages into discrete transmissions or frames for the physical layer, encapsulating the higher layer packet protocols. It provides Addressing to identify source and destination devices. It provides Error Detection to ensure that corrupted data is not propagated to higher layers.        3. Network Layer is responsible for network-wide communication, routing packets over the network between end-stations. It must accommodate multiple Data Link technologies and topologies using a variety of protocols, the most common being the Internet Protocol (IP).        4. Transport Layer is responsible for end-to-end communication, shielding the upper layers from issues caused during transmission, such as dropped data, errors and mis-ordering caused by the underlying medium. This layer provides the application with an error-free, sequenced, guaranteed delivery message service, managing the process to process data delivery between end stations. Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) are the most commonly recognised Transport Layer protocols.        5. Session Layer is responsible for establishing communications sessions between applications, dealing with authentication and access control.        6. Presentation Layer ensures that different data representations used by machines are resolved.        7. Application Layer provides generic functions that allow user applications to communicate over the network.        
For the purposes of this document we need not consider operations above the Transport Layer as the method described herein should, if well implemented, shield higher layers from issues arising in and below its scope.
Network Interconnections
A device that implements network services at the Data Link Layer and above is called a station. The Physical Layer is excluded from this definition as it is not addressable by a protocol. There are two types of station:                1. End Stations are the ultimate source and destination of network data communication across the network.        2. Intermediate Stations forward network data generated by end stations between source and destination.        
An intermediate station which forwards completely at the Data Link Layer is commonly called a Bridge; a station which forwards at the Network Layer is commonly called a Router.
Network stations attached to an Ethernet network exchange data in short sequences of bytes called packets or Protocol Data Units (PDU). PDUs consist of a header describing the PDUs destination and a body containing the payload data. In the OSI model the PDU has a distinct name at each protocol layer. A Physical Layer PDU is called a stream, at the Data Link Layer the PDU is a frame, at the Network Layer the PDU is a packet and at the Transport Layer the PDU is called a segment or message.
PDUs are encapsulated before being transmitted over the physical Ethernet hardware. Each encapsulation contains information for a particular OSI Layer, the Ethernet stream encapsulates a frame which in turn encapsulates a packet which encapsulates a message and so on. This encapsulation, containing headers and payload, is finally transmitted over the network fabric and routed to the destination.
At the Transport Layer, an associated standard, the Transmission Control Protocol (TCP), in addition to providing a simplified interface to applications by hiding the underlying PDU structure, is responsible for rearranging out-of-order PDUs and retransmitting lost data. TCP has been devised to be a reliable data stream delivery service; as such it is optimised for accurate data delivery rather than performance. TCP can often suffer from relatively long delays while waiting for out-of-order PDUs and data retransmission in extreme cases, reducing overall application performance and making it unsuitable for use where a maximum PDU transmission delay (jitter) needs to be guaranteed, in-file systems or media delivery, for example.
Furthermore, at the lowest layer of the TCP/IP hierarchy, in the network access layer where PDUs are transmitted over the network, a fully compliant IEEE 802.1D standard MAC bridge joining separate networks together requires that order is preserved for source and destination pairs.
PDU duplication is another cause of reduced performance in Ethernet networks. A unicast PDU whose destination route has not been learned by a network bridge will be flooded out to all routes from the bridge and will be buffered on multiple outbound ports at the same time. Network reconfiguration affecting the preferred route from a bridge to the destination can cause a duplicate PDU to be sent from a buffer after a duplicate PDU has already been sent out of the previous preferred route, both arriving at the destination. Again, the higher level TCP protocol will handle this but not without degrading overall performance.
Disordering and duplication should not occur during normal operation. These features of multi-path Ethernet networks are constrained by the Rapid Spanning Tree Protocol (RSTP) as defined by the IEEE 802.1D standard. The RSTP maintains a preferred route between bridges by disabling alternative routes, removing multiple paths and loops, leaving a single path that ensures in-order frame delivery.
A RSPT enforced, single path Ethernet network performs well under light network traffic load, however it starts to fail as the network traffic load increases and the number of network connected devices increase in number and performance. Many PDUs being sent concurrently across the network for different destinations will have to use the same route within the network. For some network patterns this can be particularly unfortunate for the performance of the system as a whole due to the saturation of this single route and the congestion it ultimately suffers from.
With the expansion of Ethernet networks, congestion has become a major issue, increasingly impacting networks and preventing many from ever reaching their designed performance goals. The network becomes clogged with data as an ever-increasing number of users, applications and storage devices exchange information. Congestion causes extreme degradation of data centre servers, resulting in under-utilisation of a company's expensive computing resources, often by as much as 50%. This condition will get much worse as networks get faster, with more connected devices distributed over larger geographical areas. The result will be even more wasted resource, time, money and opportunity.
Endpoint congestion can be caused when many end-stations communicate with a single end-station. This many-to-one style of communication is common in HPC and data center applications running on server clusters, it is also present when applications use network attached storage. In this latter context congestion also introduces another recognised issue, that of jitter in which the message delivery period becomes unpredictable. Congestion is an application performance killer; in a simple network delay and jitter prevent a system reaching peak performance levels. In complex networks, congestion can also necessitate the lengthier retransmission of data because intermediate stations between the endpoints simply discard or drop blocked traffic, reducing performance further. In practice, congestion spreads from the originating hot-spot until it backs up over the entire network resulting in un-associated routes being affected by a point of congestion in another part of the network. This is illustrated in the simple network diagram of FIG. 1
Initially the route from A1-B1 becomes blocked due to the server attached to B1 becoming blocked. Switch B is then blocked by subsequent data to or from ports attached to it, which cannot be delivered until the route to B1 is clear.
Very soon after Switch B congests, other connected switches become blocked as they are unable to progress their traffic through Switch B. Switch A congests and now all workstations cannot use the network effectively, even to share traffic with the storage array devices attached to Switch C. Only when B1 clears can traffic flow again, unblocking Switches B and A. The larger the network and the more intermediate stations present, the greater the likelihood of congestion occurring and the more widespread and lasting the effect.
When a network becomes congested, blocked traffic is simply thrown away by the switches in an attempt to reduce the immediate network load, hoping that the congested point will eventually clear. The TCP/IP layer in the sending device will retransmit the data after a timeout. This is disastrous for system performance, at best it greatly increases latency and significantly reduces throughput. If the congestion does not clear quickly an entire network can completely collapse and become incapable of transmitting any traffic.
Congestion will get much worse as networks continue to become larger, faster and denser, with more connected end stations distributed over larger geographical areas. Removing congestion or at least minimising the effects of congestion allows full, sustained use of data center services enabling companies to operate more efficiently and cost effectively.
With the move to 10 Gb Ethernet, devices will connect to the network at the same speed as the interconnecting fabric. This, in turn, will remove the extra network capacity that up until now has helped reduce congestion in previous network generations.
Many higher-level protocols have been devised to try to remove endpoint congestion. They all rely on trying to control the total output bandwidth of the sources sending data into the network with the intention of bringing the input bandwidth close to but not exceeding the congestion threshold. Intermediate network stations achieve this by data flow classification and upstream notification. The inspection of the data flow and subsequent messaging to rate limit the source all takes time, adding latency and complexity. All attempt to manage congestion rather than attempting to prevent it in the first place.
To date none of the congestion management techniques are particularly successful and all ultimately rely on preventing a network from ever achieving sustained peak levels of operation. Localised endpoint congestion may occur before the steady state conditions these techniques rely on have been established and some traffic patterns are inherently so unstable with rapidly changing conditions that the traffic management algorithms are never given a chance to stabilise.
The problem with all congestion management techniques is that congestion has to be occurring before remedial action can be taken. Management at this point can benefit if the network traffic is of a single type and the data rate is constant and predictable, however the benefit is often reduced in the more complex environment of the data center where services run more diverse applications with dynamically changing data flows. In high performance networks, congestion hot-spots appear rapidly and move around the network at an incredible rate. This increases the probability of over-constraining the wrong part of the network, as the point of congestion may have moved by the time notification and subsequent action have been applied.
Once congestion is identified by a management technique, data is restricted or rate-limited at the source, preventing saturation. This limits the overall systems capabilities, preventing a service from running at sustained peak performance for fear of causing congestion.
2. Description of Related Art
In US 2007/0064716 a method of controlling data unit handling is described in which congestion management measures may be selectively disabled. However, this offers no benefits in terms of preventing congestion and may indeed add to congestion problems.
In US 2006/0203730 a method of reducing end station latency in response to network congestion is described. This document proposes that in response to a congestion indicator, the introduction of new frames to a queue is prevented i.e. frames are dropped. However, as mentioned earlier this has the disadvantage that where the dropped frames form part of a large group of frames being communicated across the network, in order to ensure the frames arrive at their end station in the correct order, duplicate copies of the frames must be issued.
The present invention seeks to overcome the problems encountered with conventional networks and in particular seeks to provide a method of minimising the effects of congestion in a multi-path network and of with improving the bandwidth of the network.