Traditional systems management is largely ad-hoc. Application developers do not have a structured framework for managing their applications and achieving high reliability. The expected behaviors of applications are largely reverse engineered. Application developers do not have structured guidance and framework for how to think about managing their applications and achieving high reliability. Moreover, operating systems do not provide a holistic system for developers to leverage.
The complexity of systems is becoming too great for operators to understand. Significant time is spent tracking down dependencies and latent errors. Where instrumentation is not available, operators need to take a process dump, since this is the only way to determine what requests are executing and the current state.
Systems today do a poor job alerting users to potential and actual problems. Users cannot readily tell what applications are installed, and do not know if the system and applications have the right files and versions, are configured correctly for how they are used, are configured securely for the environment, and if they are operating optimally and not running out of resources. Moreover, applications are not easily debugged across multiple machines—there is no common application and transaction context.
Operators also cannot easily figure out application dependencies, whether files, components, configuration settings, security settings, or devices like storage area networks and routers. The system can neither warn users that a change may break other applications, nor use this information to help identify root cause.
Reactive monitoring is most common today where alerts let the user know that there was a failure, and not the cause of the problem. Advanced scripts and providers can provide more informative and actionable alerts, but lack an infrastructure for performing root cause analysis. Additional diagnostics are often needed for troubleshooting. However, a problem with reactive monitoring is that an alert is often too late—the application is already not available to users. Monitoring can help by triggering failover or taking a server offline with a load balancing device. However, the system should be sufficiently intelligent to detect potential problems in an application before the potential problems become failures.
Other problems are only detected by looking across multiple machines and clients. Examples include distributed intrusion detection and degradations in application performance. If administrators had at their disposal the capability to see a deviation from expected performance, been able to trace root cause to configuration changes as they capture snapshots, and solved the problem before users complained, many massive network performance problems and downtime could have been avoided.
Conventionally, problems with distributed applications are only determined by looking at historical data or trends from a user perspective. Administrators often do not know if their replication backlog is a problem or not, and need to run the service first and log operational metrics to establish a baseline with warning and critical thresholds.
What is needed is an improved mechanism for management infrastructure.