One of the major challenges for administrators of distributed information technology environments is maintenance of service level objectives, or SLOs. In accomplishing this goal, it is important that the administrator pinpoint the potential root cause of a system fault as soon as user transaction performance begins to degrade so that corrective action can be taken before an SLO violation occurs, such as the user perceiving major service degradation and/or disruption.
Existing methods for application-level root cause prediction primarily focus on either performance predictions based on historical data or solving queuing network models with the goal of either provisioning the system appropriately or limiting the traffic access in order to satisfy the SLO for incoming traffic. These methods, however, do not take into account system failures which can result in unexpected performance values and thus jeopardize capacity planning estimations and the limits imposed on the arriving traffic.
Existing methods for network failure prediction are typically based on either transmission control protocol (TCP) related data or management information base (MIB) variables, both of which present highly non-stationary characteristics making it difficult to accurately model the dynamics of the system without large amounts of information and processing.
Accordingly, there exists a need for techniques for predicting root cause performance degradation and failure in a distributed computing system while avoiding the complexity and inefficiency found in conventional techniques for predicting network degradation and failure.