Applications hosted in today's data centers suffer from internal fragmentation of resources, rigidity, and bandwidth constraints imposed by the architecture of the network connecting the data center's servers. The current conventional architecture statically maps web services to Ethernet VLANs, each constrained in size to a few hundred servers owing to control plane overheads. The IP routers used to span traffic across VLANs and the load balancers used to spray requests within a VLAN across servers are realized via expensive customized hardware and proprietary software. Expensive IP router ports thus become the bottleneck for any-rack-to-any-rack server connectivity, thus constraining the traffic for many data intensive applications (e.g., data mining, map/reduce computations, distributed file systems, blob stores). Further, the conventional architecture concentrates traffic in a few pieces of hardware that must be frequently upgraded and replaced to keep pace with demand—an approach that directly contradicts the prevailing philosophy in the rest of the data center, which is to scale out (adding more cheap components) rather than scale up (adding more power and complexity to a small number of expensive components). This concentration of traffic into a small number of network routers and switches also puts the network at risk of failures and outages, as failure of this small a number of components overcomes the redundancy built into the system and leads to an outage. Commodity switching hardware is now becoming available with very high port speeds at very low port cost, making this the right time to redesign the data center networking infrastructure.
A data center is comprised of both server and networking components, where the distance between the components is typically less than a millisecond of speed of light propagation time (i.e., crossing a handful of switches at 1 Gbps speeds or greater). The server portion of the infrastructure is now far down the road of commoditization—high-end enterprise-class servers have been replaced by large numbers of low cost PCs. Innovation in distributed computing and systems management software have enabled the unreliability of individual servers to be masked by the aggregated reliability of the system as a whole. The running theme is “scaling out instead of scaling up,” driven by the economics of PC commoditization. Commodity parts are characterized by: wide availability, standardization, and drivers for increased capabilities (e.g., 1 Gbps, 10 Gbps technology already available, 100 Gbps technology now emerging).
The network portion of the data center infrastructure presents the next frontier for commoditization. The increase in the number of servers that need to be interconnected has stretched the limits of enterprise networking solutions so much that current architectures resemble a spectrum of patches and workarounds for protocols that were originally intended to work in enterprise networks orders of magnitude smaller.
Some challenges and requirements of conventional data centers will now be explained with reference to FIG. 1, which shows a conventional architecture 100 for a data center, taken from a recommended source. See “Cisco systems: Data center: Load balancing data center services, 2004”, which is hereby incorporated by reference in its entirety. Multiple applications run inside the data center, but typically each application is hosted on its own set of (potentially virtual) server machines 102 with a single organization owning and controlling the activity inside the data center. Requests from the Internet 104 are typically sent to a publicly visible and routable Virtual IP address (VIP), and there are one or more VIPs associated with each application running in the data center.
Requests arriving from the Internet are IP (layer 3) routed through border routers (BR) and access routers (AR) to a layer 2 domain based on the destination VIP address. The VIP is configured onto the two load balancers (LB) connected to the top switches (S), and complex mechanisms are used to ensure that if one load balancer fails, the other picks up the traffic. See “Virtual router redundancy protocol (VRRP)” by E. R. Hinden, which is hereby incorporated by reference in its entirety. For each VIP, the load balancers are configured with a list of Direct IP addresses (DIPs), which are the private and internal addresses of physical servers 102 in the racks below the load balancers. This list of DIPs defines the pool of servers that can handle requests to that VIP, and the load balancer spreads requests across the DIPs in the pool.
As the number of servers required in the data center grows, additional pairs of switches and associated racks are added to the layer 2 domain, as shown in the figure. Layer 2 subnets are constrained in size to a few hundred servers owing to the overheads of broadcast and control plane traffic, so VLANs are configured on the Layer 2 switches to divide up the domain into multiple layer 2 subnets, one sub-net per VLAN. When the layer 2 domains eventually hits limits associated with large Ethernet domains (e.g., VLAN exhaustion, broadcast/control traffic) at a size of about 4,000 servers, additional layer 2 domains are created and connected to other pairs of access routers.
The conventional approach has the following problems:
Fragmentation of resources: Popular load balancing techniques, such as destination NAT (or half-NAT) and direct server return, require that all DIPs in a VIP's pool be in the same layer 2 domain. See “Load Balancing Servers, Firewalls, and Caches” by C. Kopparapu, which is hereby incorporated by reference in its entirety. This constraint means that if an application grows and requires more servers, it cannot use available servers in other layer 2 domains—ultimately resulting in fragmentation and under-utilization of resources. Load balancing via Source NAT (or full-NAT) does allow servers to be spread across layer 2 domains, but then the servers never see the client IP, which is often unacceptable because servers need to log the client IP for regulatory compliance and data mining.
Poor server to server connectivity: The hierarchical nature of the network means that for servers in different layer 2 domains to communicate, traffic must go through the layer 3 portion of the network. Since layer 3 ports are significantly more expensive then layer 2 ports, these links are typically oversubscribed (e.g., the capacity of the links between access routers and border routers is less than the sum of the output capacity of the servers connected to the access routers). The result is that the bandwidth available between servers in different parts of the data center can be quite limited. This creates a serious global optimization problem as all servers belonging to all applications must be placed with great care to ensure the sum of their traffic does not saturate any of the network links, and achieving this level of coordination between applications is difficult in practice. The lack of sufficient capacity between servers also fragments the pool of servers. For example, when an application running in the data center needs more servers to handle its workload, unused servers located elsewhere in the data center cannot be placed into service if there is insufficient capacity between them and the existing application servers.
Proprietary hardware that scales up, not out: The load balancers in the conventional architecture are used in pairs in a 1+1 resiliency configuration. When the load becomes too great for the load balancers, operators replace the existing load balancers with a new pair having more capacity, and it is impossible to add a single load balancer to obtain more capacity.