Technical Field of the Invention
The present invention generally relates to a multi-path network which is adapted to manage faults arising in the network and to a method of data delivery across such a network. The multi-path network and method are suitable for use in, but not limited to, multi-processor networks such as storage networks, data centres and high performance computing. In particular, the present invention is suited for use in bridges, switches, routers, hubs and similar devices including Ethernet devices adapted for the distribution of standard IEEE 802 data frames or data frames meeting future Ethernet standards.
Protocol Layers
Conceptually, an Ethernet network is decomposed into a number of virtual layers in order to separate functionality. The most common and formally standardised model used is the Open Systems Interconnect (OSI) reference model. A useful article that describes in detail the OSI reference model is “OSI Reference Model—The ISO Model of Architecture for Open Systems Interconnection” by Hubert Zimmermann, IEEE Transactions on Communications, Vol. COM-28, No. 4, April 1980. The OSI reference model comprises seven layers of network system functionality, as follows:                1. Physical Layer is responsible for physical channel access. It consists of those elements involved in transmission and reception of signals, typically line drivers and receivers, signal encoders/decoders and clocks.        2. Data Link Layer provides services allowing direct communication between end-station devices over the underlying physical medium. This layer provides Framing, separating the device messages into discrete transmissions or frames for the physical layer, encapsulating the higher layer packet protocols. It provides Addressing to identify source and destination devices. It provides Error Detection to ensure that corrupted data is not propagated to higher layers.        3. Network Layer is responsible for network-wide communication, routing packets over the network between end-stations. It must accommodate multiple Data Link technologies and topologies using a variety of protocols, the most common being the Internet Protocol (IP).        4. Transport Layer is responsible for end-to-end communication, shielding the upper layers from issues caused during transmission, such as dropped data, errors and mis-ordering caused by the underlying medium. This layer provides the application with an error-free, sequenced, guaranteed delivery message service, managing the process to process data delivery between end stations. Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) are the most commonly recognised Transport Layer protocols.        5. Session Layer is responsible for establishing communications sessions between applications, dealing with authentication and access control.        6. Presentation Layer ensures that different data representations used by machines are resolved.        7. Application Layer provides generic functions that allow user applications to communicate over the network.        
For the purposes of this document we need not consider operations above the Transport Layer as the method described herein should, if well implemented, shield higher layers from issues arising in and below its scope.
Large data networks can be constructed from many tens of thousands of components and some level of failure is inevitable. Although network protocols are designed to be tolerant to failures, components that introduce errors can easily destroy the performance of the network even though the failing components represent a tiny percentage of the total network hardware. Cracked solder joints or damaged connectors can sometimes very significantly increase the error rate of a network connection without completely breaking the connection. In some ways these connections with very high error rates are worse than completely broken connections as they may only present intermittent problems that are not conspicuous when a network topology is evaluated or when diagnostic programs are run and engineering resources are available to repair the network.
Transport layer network protocols, such as TCP, introduce reliability to an otherwise unreliable network infrastructure. These protocols achieve their robustness through checking codes such as cyclic redundancy checks (CRC), timeouts and retries. However, the overhead of detecting an error and then responding through a request to resend the data is very significant and becomes more significant as the bandwidth of the transport medium increases. If errors occur in the retried data then the loss of performance can be crippling.
Very occasional errors can be acceptable provided the error rate is low enough to make the retries overhead tiny. Having detected an error within the network it should be possible to prevent that error from re-occurring. All too often a broken or partially working connection repeatedly introduces the same error over and over again causing many hundreds of thousands of retries where only one should occurred.
Most network systems have error monitoring. This usually involves a controlling management processor either polling or being interrupted by the network hardware and then noting that an error has been detected in a portion of the network. A new set of routes are then calculated for the network as a whole that route traffic around the offending network connection until it can be repaired.
For Ethernet networks routes are calculated by an additional protocol defined in the IEEE 802.1D standard. The Rapid Spanning Tree Protocol (RSTP) and the Spanning Tree Protocol (STP) it supersedes, operates at the Data Link Layer. Its intended purpose is to remove multiple active paths between network stations, avoiding loops, which create a number of problems.
If an error or sequence of errors is detected on a link then a management agent could decide to assign a very high cost associated with using the link. Changing the cost function would re-invoke the RSTP and the very high cost value would discourage the inclusion of the link by the RSTP. Alternatively, the management agent could disable the link, again invoking the RSTP and this time preventing inclusion of the link in the new routing. Using the RSTP has some problems. It can take many milliseconds for the RSTP to re-evaluate the network. For a very large network this could be tends or hundreds of milliseconds. Also while a network is being reconfigured packets can arrive out of order, be duplicated or lost by the network. Again, for a very large network, this could be extremely disruptive causing many retries of different conversations.
A device that implements network services at the Data Link Layer and above is called a station. The Physical Layer is excluded from this definition as it is not addressable by a protocol. There are two types of station:                1. End Stations are the ultimate source and destination of network data communication across the network.        2. Intermediate Stations forward network data generated by end stations between source and destination.        
An intermediate station which forwards completely at the Data Link Layer is commonly called a Bridge; a station which forwards at the Network Layer is commonly called a Router.
Network data is fragmented into pieces as defined by the protocol. This combined, layer specific Protocol Data Unit (PDU), which generally consists of a header and a body containing the payload data, is then passed down the protocol stack. At the Ethernet Physical Layer the PDU is often called a stream; at the Ethernet Data Link Layer the PDU is often called a frame; at the Ethernet Network Layer the PDU is often called a packet; and at the Transport Layer the PDU is often called a segment or message.
PDUs are encapsulated before being transmitted over the physical Ethernet hardware. Each encapsulation contains information for a particular OSI Layer, the Ethernet stream encapsulates a frame which in turn encapsulates a packet which encapsulates a message and so on. This encapsulation, containing headers and payload, is finally transmitted over the network fabric and routed to the destination.
Some networks use adaptive routing which is an effective method for improving the total throughput of a busy network. Adaptive routing takes advantage of multiple routes that can exist from ingress to egress ports on the network. Having multiple routes allows data moving through the network to avoid congestion hot spots. Multiple routes also increase the fault tolerance of the network, allowing an internal network fabric link to be disabled while still providing a connection from the ingress to the egress ports.
FIG. 1 illustrates schematically a simplified conventional multi-path network. The rectangles on the left and right represent ingress and egress ports 2 respectively. The circles represent network crossbars 1 and the lines represent the interconnecting links, over which PDUs will traverse the network. In this example each network crossbar 1 has only three input ports and three output ports 2. Typically network crossbars have many more ports than this and this mechanism works equally well with higher arity crossbars. An example of a conventional network crossbar 1 is illustrated in FIG. 2.
In the illustrated example, a simple approach to adaptive routing would be to choose a random route out of the first switching stage that was on a link not being used by another traffic flow. This form of adaptive routing usually improves the expected total throughput for a saturated network traffic pattern but it is not controlled and could still easily result in some idle links and some over committed links between the second and third stages of switching.
DESCRIPTION OF RELATED ART
In WO 2007/022183 a method for managing nodes on a fault tolerant network is described. The method requires a switch to terminate traffic on a network channel and a network manager to reroute the traffic on a different channel. The re-routing described herein by the network manager, especially where the network has a large number of nodes and links, will experience the same problems that were described earlier in terms of the appreciable delay experienced during re-evaluation of the network by the manager.
The present invention seeks to overcome the problems encountered with conventional multi-path networks and in particular seeks to provide a network which is tolerant to faults.