The present invention relates representing a worsening system state transition as a causal sequence and using the causal sequence at runtime for problem determination avoidance and recovery from problems in complex systems.
Within the past two decades the development of raw computing power coupled with the proliferation of computer devices has grown at exponential rates. This growth along with the advent of the Internet has led to a new age of accessibility—to other people, other systems, and to information. This boom has also led to some complexity in the systems. The simultaneous explosion of information and integration of technology into everyday life has brought on new demands for how people manage and maintain computer systems.
Systems today are highly complex comprising of numerous components (servers, virtual machines, CPUs) from different vendors operating in a geographically distributed environment. A clustered Enterprise Application Server environment, Pervasive Computing environment are some examples of such complex systems. Also, these systems are dynamic, where new components can join to provide additional functions while the entire system is running. Conversely, components of the system can leave at runtime.
Additionally, the complexity of these systems and the way they work together has and will create a shortage of skilled IT workers to manage all of the systems. The problem is expected to increase exponentially, just as the dependence on technology has. As access to information becomes omnipresent through PC's, hand-held and wireless devices, the stability of current infrastructure, systems, and data is at an increasingly greater risk to suffer outages and general disrepair
One new model of computing, termed “autonomic computing,” shifts the fundamental definition of the technology age from one of computing, to that defined by data. The term “autonomic” comes from an analogy to the autonomic central nervous system in the human body, which adjusts to many situations automatically without any external help. Similarly, the way to handle the problem of managing a complex IT infrastructure is to create computer systems and software that can respond to changes in the IT (and ultimately, the business) environment, so the systems can adapt, heal, and protect themselves. In an autonomic environment, components work together communicating with each other and with high-level management tools. They can manage or control themselves and each other.
Self healing technologies are one of the pillars of autonomic computing and on demand. Self-healing requires detecting problematic operations (either proactively through predictions or otherwise) and then initiating corrective action without disrupting system applications. The first step toward this direction is problem determination. Self-healing systems are typically rule driven. Rules define what the system should do to diagnose and correct a problem. However, most problem determination and mitigation solutions today assume that the system is entirely deterministic and hence use automation to fix problems based on rules developed at design time.
Traditionally, problems in complex systems are reactive in nature, typically by gathering and then inspecting log and/or trace files. The log/trace files contain raw data that is analyzed to extract meaning. However, these log/trace files do not have a way to capture any particular variations of a components behavior. Therefore, in a traditional diagnostic process, the rules are modified and/or components re-instrumented to accommodate the behavior variations.