The invention relates generally to the field of fault detection and localization in complex systems. More specifically, embodiments of the invention relate to methods and systems for automatically modeling transaction flow dynamics in distributed transaction systems for fault detection and localization.
Today, numerous Internet services such as Amazon, eBay and Google have changed the traditional business model. With the abundance of Internet services, there are unprecedented needs to ensure their operational availability and reliability. Minutes of service downtime can lead to severe revenue loss and user dissatisfaction.
An information system for an Internet service is typically large, dynamic, and distributed and can comprise thousands of individual hardware and software components. A single failure in one component, whether hardware or software related, can cause an entire system to be unavailable. Studies have shown that the time taken to detect, localize, and isolate faults contributes to a large portion of the time to recover from a failure.
Transaction systems with user requests, such as Internet services and others, receive large numbers of transaction requests from users everyday. These requests flow through sets of components according to specific application software logic. With such a large volume of user visits, it is unrealistic to monitor and analyze each individual user request.
Data from software log files, system audit events, network traffic statistics, etc., can be collected from system components and used for fault analysis. Since operational systems are dynamic, this data is the observable of their internal states. Given the distributed nature of information systems, evidence of fault occurrence is often scattered among the monitored data.
Advanced monitoring and management tools for system administrators to interpret monitoring data are available. IBM Tivoli, HP Open View, and EMC InCharge suite are some of the products in the growing market of system management software. Most current tools support some form of data preprocessing and enable users to view the data with visualization functions. These tools are useful for a system administrator since it is impracticable to manually scan a large amount of monitoring data. However, these tools employ simple rule-based correlation with little embedded intelligence for reasoning.
Rule-based tools generate alerts based on violations of predetermined threshold values. Rule-based systems are therefore stateless and do not manage dynamic data analysis well. The lack of intelligence results from the difficulty in characterizing the dynamic behavior of complex systems. Characterization is inherently system-dependent in that it is difficult to generalize across systems with different architectures and functionality.
Detection and diagnosis of faults in complex information systems is a formidable task. Current approaches for fault diagnosis use event correlation which collects and correlates events to locate faults based on known dependencies between faults and symptoms. Due to the diversity of runtime environments, many faults experienced in an interconnected system are not very well understood. As a result, it is difficult to obtain precise fault-symptom dependencies.
One attempt at understanding relationships between system faults and symptoms was performed by the Berkeley/Stanford Recovery-Oriented Computing (ROC) group. JBoss middleware was modified to monitor traces in J2EE (Java2 Enterprise Edition) platforms. JBoss is an open source J2EE based application server implemented in pure Java. J2EE is a programming platform for developing and running distributed multi-tier architecture applications, based largely on modular components running on an application server. Two methods were developed to collect traces for fault detection and diagnosis. However, with the huge volume of user visits; to monitor, collect and analyze the trace of every user request was problematic. Most methods of collecting user request traces results in a large monitoring overhead.
It is a major challenge for system administrators to detect and isolate faults effectively in large and complex systems. The challenge is how to correlate the collected data effectively across a distributed system for observation, fault detection and identification. It is therefore desirable to develop a method and system that considers the mass characteristics of user requests in complex systems and has self-cognition capability to aid in fault analysis.