Could-based data centers may be employed by an enterprise in a variety of settings for running service applications and maintaining data for business and operational functions. For example, a data center within a networked system may support operation of a variety of differing service applications (e.g., web applications, email services, and search engine services). These networked systems typically include a large number of nodes distributed throughout one or more data centers, in which each node is associated with a physical machine. For example, FIG. 1 illustrates a system 100 having a number of computing nodes 110 (nodes 1 through N). Each computing node 110 may execute one or more Virtual Machines (“VM”), such as virtual machines A through D illustrated in FIG. 1. A virtual machine assignment platform 150 may assign new virtual machines 160 to be executed by a particular computing node 160 (e.g., virtual machine X might be assigned to be executed by computing node 1). Due partly to the substantial number of computing nodes 110 that may be included within a large-scale system, detecting anomalies within various nodes 110 can be a time-consuming and costly process.
For example, the computing nodes 110 in the system 100 may be susceptible to hardware errors, software failures, misconfigurations, or bugs affecting the software installed on the nodes 110. Therefore, it may be necessary to inspect the software and/or hardware to fix errors (e.g., a disk failure) within the nodes 110. Generally, undetected software and/or hardware errors, or other anomalies, within the nodes 110 may adversely affect the functionality offered to component programs (e.g., tenants) of a customer's service application residing on the nodes 110.
At the present time, data-center administrators are limited to an individualized process that employs manual efforts directed toward reviewing hardware and software issues individually on each node 110 in a piecemeal fashion. Moreover, the system 100 may represent a dynamic platform that is constantly evolving as it operates. For example, there may be large number of nodes 110 running various combinations of component programs. As such, the configuration of these nodes can vary for each component-program combination. Further, the configurations may be progressively updated as new services and/or features are added to virtual machines and/or node hardware is replaced.
Conventional techniques that attempt to detect misconfigurations are reactionary in nature. For example, conventional techniques are typically invoked only upon a failure issue being detected. At that point, an administrator within the hosting service might be tasked with manually diagnosing the issue and ascertaining a solution. This can make it difficult for a data center to achieve reliability beyond the “four nines” (that is, 99.99% reliability).
What is needed is a system to accurately and efficiently improve data center reliability.