In data center environments, be it a service-provider data center, an enterprise data center, or a massively scalable data center (“MSDC”), what is desired is any-to-any communication between a plethora of servers, which may be physical and/or virtual. IP (v4 or v6) has become the defacto standard for these environments. With scalability and mobility (if VMs are used) being the two main problems in these environments, typical architectures being proposed for these environments involve servers attached to top-of-rack (“ToR”) switches which are interconnected via a set of spine switches. The specific topology may be a single or multi-tier fat-tree or maybe even something resembling the traditional three level access-aggregation-core framework.
There is common consensus that in environments where the east-to-west server to server traffic is going to dominate, flood/broadcast traffic emanating from Address Resolution Protocol/Neighbor Discovery Protocol (“ARP/ND”) should be terminated at the ToR switches with other means to solve the problem of communicating the host address space to all the ToRs. (See e.g., http://tools.ietf.org/html/draft-shah-armd-arp-reduction-01; R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, A. Vandat, “PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric”, ACM SIGCOMM, (August 2009); Aled Edwards, Anna Fischer, and Antonio Lain. 2009. Diverter: a new approach to networking within virtualized infrastructures. In Proceedings of the 1st ACM workshop on Research on enterprise networking (WREN '09). ACM, New York, N.Y., USA; C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE: a scalable ethernet architecture for large enterprises. In SIGCOMM, 2008)
This could be using a centralized coordinator or directory service as described in the PortLand article cited above which is incorporated in its entirety. Other described centralized coordinators or directory services may also be found in http://tools.ietf.org/html/draft-malc-armd-moose-00I-Interfaces (SVIs). Alternatively, Interior Border Gateway Protocol (“iBGP”) with Route Reflectors (“RR”) may be employed.
Essentially, all these approaches reduce the flood/broadcast traffic due to ARP not only in the fabric but more importantly toward the end-hosts. No matter what approach is taken, when hosts behind a ToR are trying to communicate with remote hosts whether in the same or different subnets, there is an inherent delay at the start before the appropriate remote host entry gets installed in the forwarding information base (“FIB”) hardware table before they can start communicating. The delay is due to the fact that the appropriate host entry (/32 for ipv4 and /128 for ipv6) has not been communicated from the remote ToR to the local ToRs (whether using iBGP or the directory service like approach).
In addition, if the table sizes at the ToR fail to accommodate the host entries for all hosts in the data center, not all entries communicated from remote ToRs can be blindly installed in the hardware FIB tables. In that case, only entries of active flows will be maintained at the ToR which requires some software intervention to implement a form of conversational L3 learning. This further adds to the startup delay.
Traditionally, for hosts in the same subnet, the gratuitous ARP requests allow all hosts in that subnet to learn about each other. So hosts in the same subnet can talk to each other without any startup delay. With ARPs being terminated at the ToR, this may be somewhat compromised. Moreover, for hosts in different subnets, typically the subnet prefix entry hit points to the GLEAN adjacency. The packet may be punted to software which subsequently triggers an ARP request for the destination in the directly attached subnet that on resolution results in a host /32 entry being installed in hardware.
While software is performing this ARP resolution for the destination, packets for the flow may be buffered with a tail-drop policy. With the drive towards higher bandwidth pipes to the servers, most of the packets may be dropped since the queue cannot accommodate these large packet bursts. Moreover, even if it could, there could be out-of-order issues since once the entry is installed in hardware, packets hitting the hardware entry are likely to reach the destination host quicker than the buffered packets that will be software switched. So in these cases, software may be better off dropping these packets rather than software switching anything which may also put unnecessary burden on the ToR CPUs.
There is a need for a solution for this “slow-start” problem in data center environments. The present disclosure presents such a solution using IPv4 as an example. It should be noted that similar embodiments would also be effective with IPv6.