Large scale data centers typically comprise organized clusters of hardware machines running collections of standard software packages, such as web servers, database servers, and the like. For fault tolerance and management reasons, the machines in a datacenter are typically divided into multiple clusters that are independently monitored and managed by a framework that coordinates resources for software applications. In one embodiment, the framework may be a Windows Azure™ Fabric Controller, for example, that provisions, supports, monitors, and commands virtual machines (VMs) and physical servers that make up the datacenter.
In existing datacenters, each tenant is deployed to a single cluster for its entire lifecycle, which allows the tenants' deployment to be managed by a single framework. This configuration may limit the tenant's growth, however, as expansion is limited to the machines within the single cluster. The tight coupling between tenants and clusters requires datacenter operators to maintain the capacity for a cluster at a level that will satisfy the potential future requirements for the tenants deployed on that cluster. Often, this results in the clusters operating at a low current utilization rate in anticipation of possible future needs. Even when excess capacity is maintained, this only improves the likelihood that a tenant's future needs will be supported. There is no guarantee that a tenant scale request will be limited to the reserved capacity and, therefore, at times a tenant may be unable to obtain the required capacity.
Limiting a service to one cluster also creates a single point of failure for that service. If the framework controlling that cluster fails, then the entire cluster will fail and all services supported on the cluster will be unavailable.