The present invention relates generally to detecting faults, and more particularly to detecting faults of a computer system based on user requests.
Over the past few decades, the Internet service has become extremely popular. On-line searching, shopping, and transactions have become part of people's lives. Behind popular web sites are typically large, dynamic and distributed systems that may consist of many components such as servers, software, and networking and storage equipment.
While the components themselves are often complicated, the dynamic interaction between these components introduces another level of complexity. Additionally, new software and hardware components are added to these systems as new functionalities are added.
Further, Internet services may receive a large number of user requests on a daily basis. These requests behave like probes into the system. In particular, these requests often test various parts of the system in a brute force manner by causing the system parts to work together to service the request. These requests are conventionally serviced by a sequence of components (e.g., an enterprise JavaBean, a Servlet, etc.) of the system. A fault or bug in the system could affect the operation of the sequence of components used to service the user requests.
Detection and diagnosis of faults in such a system has traditionally been, and continues to be, a formidable challenge. One approach to fault detection is based on event correlation. Event correlation typically involves monitoring networks and other systems in order to identify patterns of events that might signify a fault or risk to the system. Most event correlation systems (and other root cause analysis techniques) are based on static dependency models describing the relationships among the hardware and software components in the system. These dependency models may be used to determine which components might be responsible for a given problem. One limitation of traditional dependency models is the difficulty of generating and maintaining an accurate model of a constantly evolving Internet service. Another limitation is that it is often difficult to construct fault-symptom (patterns of events) mapping relationships in a large and complex system. In general, such a relationship is often system-dependent and cannot easily be generalized across different systems.