System management is a difficult task. Systems management generally includes enterprise-wide administration of distributed computer systems, and may involve one or more of the following tasks: managing hardware inventories, server availability monitoring and metrics, software inventory and installation, anti-virus and anti-malware management, user activity monitoring, capacity monitoring, security management, storage management and network capacity and utilization monitoring.
However, software failures are commonplace as computer systems become larger and more complex. Detection and diagnosis of software errors is a major administration cost, especially in server farm or data center environments where problems often go undetected.
Many software failures are caused by configuration errors. Such failures, or errors, can be triggered by a variety of reasons, including administrator mistakes, disk corruption, software bugs and malware. Since software configurations are persistent, configuration errors cannot be easily fixed with simple rebooting. Typical solutions involve prolonged manual troubleshooting sessions, or re-imaging the problematic machines, at the risk of losing data. The problem of managing software configurations in a large data center with tens to hundreds of thousands machines is even more costly due to the large number of computers and their diverse applications and workloads.
While several approaches have attempted to automate configuration error troubleshooting, they all rely on administrators or other users to detect the symptoms of errors in the first place. However, such manual detection is unreliable. To list a few examples, inexperienced users may not correlate the application failures they are experiencing with configuration errors; in a data center environment, administrators cannot afford to simultaneously monitor thousands of machines; in the worst case, a user may never notice anything on the surface, while her machine's underlying security policy leaves doors wide open to attackers. As a result, a user may detect a configuration error only after a long delay, when severe damage has already been done, making it impossible to recover the badly corrupted machine state.
Improved techniques are need for troubleshooting configuration errors.