Ethernet™ is a link-layer (Layer 2) protocol defined by IEEE standard 802.3. Ethernet networks have conventionally been regarded as an unreliable communication medium, giving no guarantee that a packet injected into the network will arrive at its intended destination. Transmitters in traditional Ethernet networks may send packets faster than receivers are able to accept them, and when a receiver runs out of available buffer space, it silently drops the packets that exceed its capacity. Reliability, when required, was provided by upper-layer protocols, such as the Transmission Control Protocol (TCP). By contrast, other types of networks, such as InfiniBand™ networks, were designed to incorporate flow control at the link level, which enables a receiving node to convey feedback to a corresponding transmitting node in order to communicate buffer availability, and thus support reliable link-layer transmission.
More recently, mechanisms of priority flow control (PFC) have been developed to provide reliable link-layer transmission in Ethernet networks. Such mechanisms are described, for example, in a white paper entitled, “Priority Flow Control: Build Reliable Layer 2 Infrastructure” (Cisco Systems, Inc., San Jose, Calif., 2009). They are based on IEEE 802.3x PAUSE control frames, as defined in Annex 31B of the IEEE 802.3 specification. A receiver can send a medium access control (MAC) frame with a PAUSE request to a sender when it predicts the potential for buffer overflow, and the sender will respond by stopping transmission of any new packets until the receiver is ready to accept them again.
The IEEE 802.1Qbb standard for Priority-based Flow Control extends the basic IEEE 802.3x PAUSE semantics to multiple classes of service, with the possibility of independent flow control for each class. For this purpose, PFC uses class of service (CoS) values provided by the IEEE 802.1p standard, which are inserted in the virtual local area network (VLAN) tag of Ethernet frames (as defined by the IEEE 802.1Q standard). The three-bit priority code point (PCP) field of the VLAN tag can be used to specify eight different classes of service for such purposes, which the 802.1Q standard recommends be defined as follows, in order from lowest priority (0) to highest (7):
TABLE IETHERNET CLASSES OF SERVICEPCPPriorityAcronymTraffic Types10BKBackground01BEBest Effort22EEExcellent Effort33CACritical Applications44VIVideo, <100 ms latency andjitter55VOVoice, <10 ms latency andjitter66ICInternetwork Control77NCNetwork Control
Ethernet Layer-2 networks are commonly integrated as subnets of Layer-3 Internet Protocol (IP) networks. A subnet (short for subnetwork) is a logical subdivision of a Layer-3 network. Network ports of nodes within a given subnet share the same Layer-3 network address prefix. For example, in IP networks, the ports in each subnet share the same most-significant bit-group in their IP address. Typically, the logical subdivision of a Layer-3 network into subnets reflects the underlying physical division of the network into Layer-2 local area networks. The subnets are connected to one another by routers, which forward packets on the basis of their Layer-3 (IP) destination addresses, while within a given subnet packets are forwarded among ports by Layer-2 switches or bridges. These Layer-2 devices operate in accordance with the applicable Layer-2 protocol and forward packets within the subnet according to the Layer-2 destination address, such as the Ethernet MAC address.
Routing protocols are used to distribute routing information among routers, so as to enable each router to determine the port through which it should forward a packet having any given Layer-3 destination address. In IP networks, the routing information is generally developed and distributed by and among the routers themselves. A number of routing protocols are commonly used to exchange routing information among IP routers, such as Open Shortest Path First (OSPF) and the Border Gateway Protocol (BGP).
Remote direct memory access (RDMA) protocols enable direct memory access over a network from the memory of one computer to another without directly involving the computer operating systems. In InfiniBand networks, RDMA read and write operations are an integral part of the transport-layer protocol. These operations provide high-throughput, low-latency data transfers, which are carried out by the network interface controller (generally referred to in InfiniBand parlance as a host channel adapter, or HCA) under application-level control. RDMA over Converged Ethernet (RoCE) provides similar capabilities over an Ethernet network, but as such supports communication only between hosts in the same Ethernet (Layer 2) broadcast domain, i.e., with a range no greater than a single IP subnet. The Internet Wide Area RDMA Protocol (iWARP) overcomes this limitation by providing RDMA service over a connection-oriented transport protocol, typically TCP, but has not gained wide acceptance.