An organization (e.g., large company) may have a data center with hundreds or thousands of servers to run its application programs. These applications may include web servers, web searching, web crawling, social networking, customer relationship management (“CRM”), enterprise resource planning (“ERP”), accounting, human resource management, and so on.
A data center typically has a management and deployment (“M&D”) system that controls the overall allocation of computing resources (e.g., servers and disk space) in the data center. To help with the management, an M&D system may logically organize the servers of the data center into groups of servers that are logically related in some way. Each logical grouping of servers may be referred to as an environment. Each server in an environment may be assigned a specific functionality, referred to as a machine function. For example, a web search service may need multiple environments that each support a sub-service such as searching or crawling. Each environment may have multiple servers that each support a single machine function needed to support the sub-service of the environment. For example, an environment that supports searching may have some servers with machine functions that support retrieving results and others that support ranking results. The combination of an environment and machine function is referred to as a primary tenant.
To ensure that each primary tenant has sufficient computing resources, an organization may allocate more than enough servers to meet the anticipated peak demand. As a result, the central processing unit (“CPU”) and the disk space utilizations of the servers may be relatively low. Attempts have been made to allow other applications to run on these servers, referred to as co-location of applications, so that the computing resources do not go wasted. These co-located applications, which are typically batch jobs, are referred to as secondary tenants. They are secondary tenants in the sense that the primary tenant is given a higher priority so that its processing can be performed in a timely manner. For example, an organization may have data analytics applications that run as secondary tenants. Each primary tenant executing at a server may use the local file system of that server, and the secondary tenants may use a distributed file system. The use of a local file system by the primary tenants helps improve performance of the primary tenants as their data is stored locally. The use of a distributed file system by the secondary tenants helps ensure that their data will be accessible even if a secondary tenant is moved to a different server.
In addition to an M&D system, a data center may provide a distributed file system that further supports security of the data and also supports replication of data to help ensure the reliability of the data. One such file system is the Hadoop Distributed File System (“HDFS”), which supports storing data on a local storage device (e.g., disk) of each server. The HDFS includes a global Name Node (“NN”) running on a dedicated server and a Data Node (“DN”) running on each server. The NN manages the file system namespace, selects to which DNs each block of a file is to be stored, and maintains a mapping of blocks to DNs. The HDFS replicates each block (e.g., 256 MB) three times by default. It tries to place a first replica on the server that created the block, a second replica in another server in the rack that contains the server that created the block, and a third replica in a server of a different rack. To store a block, a client sends a request to the NN, the NN returns a list of servers to which replicas of the block are to be stored, and the client requests the DN of each server in the list to store a replica. To access a block, a client sends a request to the NN, the NN returns a list of the servers that store replicas of the block, and the client requests the DN of the servers in the list to provide access to the replica that it stores until access is successfully provided. Each DN manages the blocks on its local storage according to the NN's commands and accesses the blocks on behalf of clients. The HDFS recreates lost replicas while trying to avoid overloading the data center. A replica may be lost for various reasons, such as a failure at a server that stores a replica or the reimaging of a disk that stores a replica. A disk can be reimaged for a variety of reasons. For example, a disk may be reimaged when an environment is to be redeployed or restarted from scratch, when the M&D system conducts resiliency testing, and when a disk has undergone maintenance.
To help ensure a robust computing environment, an M&D system may collect various types of performance statistics on a per-server basis. For example, an M&D system may collect average CPU utilization information for each server on a periodic basis (e.g., every two minutes). As another example, an M&D system may track each reimaging of a disk on a per-server basis.