In a cloud infrastructure, it is often the case that many core components (e.g., operating system components, agents running in the host environments of each physical machine in the cloud, or the like) are being independently and continuously updated to fix or enhance features of the cloud infrastructure. Deploying a problematic component broadly to the cloud (e.g., updating a new build or a component to thousands or millions of computing devices) may cause a downtime of many virtual machines and could potentially lead to significant profit loss that severely impact customers. The cloud infrastructure often has a variety of configurations, both in hardware and software, and the initial impact of a failure due to a deployment may typically be hidden as the entire cloud appears to be healthy even though specific configurations may be severely impacted. In a conventional system, there may already be a wide impact radius once such a failure is detected, often days after the deployment.
There are challenges in detecting and correlating failures to specific causes in the cloud infrastructure. Failures may be caused by defects from multiple deployed components because of the highly-coupled nature of infrastructure components. There also may be multiple sources of failures, such as deployment failures, settings changes, workload, or hardware issues. Thus, failure signals may be noisy. Furthermore, the latency of failures may be varied making it difficult to pinpoint failures. Immediate failures may happen seconds or minutes after a deployment, whereas non-immediate failures may happen hours or days after the deployment.