The present invention relates in general to distributed computing environments having a plurality of processing nodes, and more particularly, to a technique for referencing failure information representative of multiple related failure conditions occurring within the distributed computing environment at the same or different nodes of the plurality of nodes of the environment.
A distributed system is often difficult to manage due to complicated and dynamic component interdependencies. Managers are used in a distributed system and are responsible for obtaining information about the activities and current state of components within the system, making decisions according to an overall management policy, and performing control actions to change the behavior of the components. Generally, managers perform five functions within a distributed system, namely configuration, performance, accounting, security, and fault management.
None of these five functions is particularly suited for diagnosing faults occurring in complex distributed systems. Diagnosing faults using manual management is time consuming and requires intimate knowledge of the distributed system. Also, it is difficult to isolate faults in a distributed environment because a resource limitation on one system may cause a performance degradation on another system, which is not apparent unless one is very familiar with the architecture of the distributed application and how the components work together.
In distributed computing environments, many software components are exploited in an interdependent fashion to provide function to the end-user. End-users are often not aware of the interdependencies of the various components; they only know that the environment provides some expected function. The components may be distributed amongst the various compute notes of the distributed computing environment. In cases where a component experiences a failure, this failure can ripple throughout the distributed computing environment, causing further failures on those components that rely upon the failed component for a specific function. This ripple effect continues, with components affecting the function of those components that rely upon them, until ultimately the end-user is denied the expected function.
The challenge in this environment is to trace the failure condition from its symptom (in this case, the denial of the expected function) to as close to the root cause of the problem (in this case, the original failed component) as possible in an acceptable period of time. Complicating this effort is the fact that multiple failure conditions may exist in the distributed computing environment at the same time. To properly identify the root cause, the failure conditions related to the failure symptom in question must be identified, and information pertaining to those failure conditions must be collected. Unrelated failure conditions should be eliminated from the analysis, since repair of these conditions would not lead to a repair of the failure symptom in question. Identifying these related failures has heretofore required an intimate knowledge of the distributed computing environment, its implementation, and the interdependencies of its components. Even with this level of knowledge, problem determination efforts are non-deterministic efforts, based on the xe2x80x9cbest guessxe2x80x9d of the problem investigator as to where the root cause of the failure condition in question may reside. The larger and more complex the distributed computing environment, the more components introduced into the environment, the more difficult it becomes to reliably xe2x80x9cguessxe2x80x9d where the source of the failure may reside. The knowledge necessary to undertake the problem determination effort resides only with the distributed computing environment manufacturer, making it difficult for distributed computing environment administrators to effectively identify and resolve failures.
Briefly summarized, the present invention comprises in one aspect a method for referencing failure information in a distributed computing environment having a plurality of nodes. The method includes: creating a failure report by recording information on a failure condition upon detection of the failure condition at a node of the distributed computing environment; and assigning an identifier to the failure report and storing the failure report at the node, wherein the identifier uniquely identifies the failure report including the node within the distributed computing environment creating the failure report, and where within storage associated with the node the failure report is located.
In another aspect, the present invention comprises a method for referencing failure information in a distributed computing environment having a plurality of nodes. This method includes: creating a first program failure report upon detection of a first program failure condition at a first node; assigning a first identifier to the first program failure report which uniquely identifies the first program failure report including the node within the distributed computing environment creating the first program failure report and where within storage associated with that node the first program failure report is located; creating a second program failure report upon detecting a second program failure condition at a second node which is related to the first program failure condition, wherein the second program failure report is created by recording information on the second program failure condition at the second node, and wherein the second node and the first node may comprise the same node or different nodes within the distributed computing environment; and assigning a second identifier to the second program failure report which uniquely identifies the second program failure report including the second node within the distributed computing environment creating the second program failure report, where within storage associated with the second node the second program failure report is located, and the first identifier for the first program failure report on the first program failure condition related to the second program failure condition.
Systems and at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the above-summarized methods for referencing failure information in a distributed computing environment are also described and claimed herein.
To restate, presented is a technique for referencing failure information within a distributed computing environment. Persistent storage is employed which is accessible to all components of the environment. Reports of failures detected by system components, recorded to the persistent storage, preferably describe the nature of the failure condition, possible causes of the condition, and recommended actions to take in response to the condition. An identifier token is assigned which uniquely identifies a specific failure report for the failure condition, including location where the record resides within the distributed computing environment and the location within the persistent storage of that node where the record resides. Using this identifier, the failure report can be located from any location within the distributed computing environment and retrieved for use in problem determination and resolution analysis. This identifier is passed between related components of the environment as part of a component""s response information. Should a component experience a failure due to another component""s failure, the identifier is obtained from the first component""s response information and included within the information recorded as part of the second component""s failure report.
In accordance with the principles of the present invention, the previous need to guess where the distributed computing environment problem determination should begin to search for failure records is eliminated. The unique failure identifier provided to the end-user application will permit problem determination efforts to locate the failure reports regardless of where they reside within the distributed computing environment. The invention removes the need to identify failure reports related to the condition being investigated. The failure identifier references a failure report which in turn references one (or more) other failure report(s) associated with it. The related report will cite another related report, etc. The need to identify the failure reports which relate to the failure is thus removed, since each failure report explicitly cites the next related failure report.
In addition, an intimate understanding of the implementation and interdependencies of the distributed computing environment is no longer necessary to trace a failure condition. The present invention places the capability for performing problem determination and resolution back into the hands of the distributed computing environment administrator, instead of requiring the intervention of the distributed computing environment manufacturer. Guessing where the problem determination efforts should proceed from a specific point is no longer an issue, since the failure report will cite the next related problem, and hence where the investigation should next proceed. When no related link is reported, problem determination efforts begin at that point. It is no longer necessary to separate problem symptoms from root causes. The failure report for a problem symptom will specifically cite a report for its cause, or at the very least a next link in the list of related failures which will ultimately lead to the root cause. In accordance with the principles of the present invention, the problem symptom becomes a useful starting point for problem determination efforts, wherein in previous systems, it only clouded the effort. If the problem symptom is not also the root cause of the problem, it will contain a link to a chained list of problems and eventually lead problem determination efforts to the root cause.