As the number of datacenter devices increases, the deployment of hardware and software for cloud infrastructure and services becomes more complicated. When operating systems and software components will be installed and configured across many physical and virtual machines, it can be very difficult for a datacenter operator to detect when one or more of many components fails, identify which components have failed, and analyze the failure to determine the appropriate resolution.
Typically, existing datacenter components expose multiple error-detection mechanisms that are not tied together and that do not work well across multiple servers or operating system instances. As a result, in current configurations, it is up to the operator to manually correlate and interpret error logs and events on multiple systems and/or devices to diagnose failures. This can be especially difficult to identify the source of timing-dependent or transient issues, such as temporary network glitches, security intrusions, deployment failures etc., across a large number of datacenter devices and distributed services. When a datacenter service provider configures their cloud service, it can take multiple days or weeks to install the service successfully.