Ever since the introduction of the microprocessor, computer systems have been getting faster and faster. In approximate accordance with Moore's law (based on Intel® Corporation co-founder Gordon Moore's 1965 publication predicting the number of transistors on integrated circuits to double every two years), the speed increase has shot upward at a fairly even rate for nearly three decades. At the same time, the size of both memory and non-volatile storage has also steadily increased, such that many of today's personal computers are more powerful than supercomputers from just 10-15 years ago. In addition, the speed of network communications has likewise seen astronomical increases.
Increases in processor speeds, memory, storage, and network bandwidth technologies have resulted in the build-out and deployment of networks with ever increasing capacities. More recently, the introduction of cloud-based services, such as those provided by Amazon (e.g., Amazon Elastic Compute Cloud (EC2) and Simple Storage Service (S3)) and Microsoft (e.g., Azure and Office 365) has resulted in additional network build-out for public network infrastructure, and addition to the deployment of massive data centers to support these services which employ private network infrastructure. Additionally, the new generation (i.e., 4G) of mobile network data services is expected to significantly impact the utilization of land-line networks in the near future. The result of these and other considerations is that the utilization of computer networks is expected to continue to grow at a high rate for the foreseeable future.
FIG. 1 depicts a conventional computer network architecture 100 employing a plurality of switches 102 labeled 1-36 communicatively coupled to one another via links 104. A source computer 106 is coupled to switch 14 via an Internet Service Provider (ISP) network 108. On the other side of the network a destination computer 110 comprising an e-mail server is connected to switch 20 via an e-mail service provider (ESP) network 112.
Each of switches 1-35 includes a routing or forwarding table that is used to route/forward packets to a next hop based on various criteria, which typically includes the destination address. Under various routing protocols such as the Internet Protocol (IP), data is partitioned into multiple packets that are routed along a path between a source endpoint and a destination endpoint, such as depicted by source computer 106 and destination computer 110. In general, the path traversed by a given packet may be somewhat arbitrary, which is part of why the Internet is so robust. Since packets between endpoints can travel along different paths, when a network switch goes down or is taken offline, the routing tables for the other switches are updated to route packets along paths that do not include that switch.
FIG. 1 further illustrates an exemplary routing path that includes hops between switches 14, 15, 22, 23, 16, 17, 18, and 19. Of course, subsequent packets may be routed along different routes, depending on the classification of the traffic being transmitted between source computer 106 and a destination computer 110, as well as real-time network operating conditions and traffic handled by the various network switches.
In computer networks, packets may be lost due to due to various reasons, including bit errors, congestion, or switch failures. When a packet is lost on a path between a source and its destination, it typically needs to be retransmitted from the source. This has two problems. First, since the path from source to destination is reasonably long, it takes a long time for the source to find out that a packet has been lost. Second, the progress the packet made before getting dropped is wasted bandwidth.
On lossy links with many bit errors, hop-by-hop reliability is sometimes done. This means that when a first switch S1 is forwarding to a neighbor switch S2, S1 and S2 run a reliable protocol in which S1 holds onto each packet until it is acknowledged by S2 as being successfully received without errors, retransmitting packets that got lost or dropped. An example of such a protocol is High-Level Data Link Control (HDLC) or Digital Data Communication Message Protocol (DDCMP). This requires more complex and expensive switches, since it requires more buffers for S1 to hold onto packets until receiving an acknowledgement from S2.
Another approach to avoiding packet loss due to bit errors on links is to use error correcting codes so that a packet can be reconstructed, provided there are not too many bit errors. This has a lot of overhead, both in terms of extra checksum bits, and computation, and there is still the possibility that there are more errors than can be handled by the error correcting code.
Network congestion is currently addressed in one of two ways: dropping packets or implementing backpressure on incoming ports (typically on a per service-class basis). As previously noted, the problem with dropping packets is that a packet that has already traveled several hops towards the destination must be retransmitted again from the source, so that amount of bandwidth has been wasted. Also, the end-to-end delay for the source to discover the packet has been dropped can be long, because information about whether a packet has been received needs to be communicated from the destination. As a result, a source will typically employ a timeout prior to resending a packet if no ACK has been received prior to the timeout expiring.
The problem with doing backpressure the way it is traditionally done (e.g., Infiniband or Data Center Bridging) is that congestion can spread; a single slow resource (e.g., a destination) can have its packets occupy all the buffers in a switch, and since the switch is not allowed to drop those, the switch must refuse to receive any more packets (for that class), even though those new packets may not be travelling towards the congested resource. This in turn can cause buffers to become full in adjacent switches, etc.