According to recent trends, information technology (IT) systems of companies are becoming ever more large and complex. For example, in some businesses, the IT system is no longer just an infrastructure of the business, but needs to act in partnership with the business to increase the value and competitiveness of the business. Furthermore, the rapid growth of IT systems is not limited to very large companies, but even mid-sized companies can now have hundreds of servers. In addition, the rapid growth of server virtualization technology is causing an acceleration of this trend.
Despite the recent trends of massive growth in data centers and other IT systems, the administrators of IT organizations are still required to efficiently manage these large and complex IT systems to keep them running properly. When a problem occurs, the administrators need to recognize that there is a problem, analyze the problem, and then resolve the problem as soon as possible.
Typically, monitoring the health of an IT system and analyzing any problems that may arise is carried out using some form of availability and performance management software. This software usually includes the ability to discover devices in the IT system, identify their connections, and sometimes also identify locations where problems are occurring. Through use of such management software, administrators are relieved from a number of tedious operation tasks that they used to have to perform manually. However, as mentioned above, IT systems themselves are growing rapidly, while IT budgets are typically becoming more restricted. This has resulted in each administrator being responsible for managing a very large area of the IT system, and the size of these systems can make it difficult to determine the actual location and “root cause” of a problem that might occur. For example, some vendors provide root cause analysis products, but these products fail to provide any mechanisms for determining the time range of events to be inputted to the analysis engine. This means that calculation costs are inefficient and the accuracy of analysis is inadequate. Therefore, an on-going need exists for a solution to assist administrators in finding the root cause of failures, defects or other occurrences in an IT system environment.
Root Cause Analysis is a technology for locating a node in an information system which is the root cause of an error in the information system environment. For example, in an information system having a topology made up of a number of different nodes, such as servers, switches, storage systems, and the like, if one of those nodes should cause a failure, error or other occurrence in the system, the failure will affect any other nodes connected to that node in the system topology, and error event messages may be issued to the administrator from a number of different nodes in the IT system. Thus, in some cases it can be very difficult for an administrator to determine which node in the system is the actual root cause of the errors.
A root cause analysis engine analyzes the plural error event messages and their relationships to each other, and then outputs a calculated root cause as a result of the analysis. Currently, there are two well-known root cause analysis technologies that are widely used. One of these is known as Smarts Codebook Correlation Technology, and the other is a technology utilizing expert system analysis, also referred to as a rule deduction engine, and examples of which include the Rete algorithm and Hitachi's ES/Kernel.
Smarts Codebook Correlation Technology (CCT)
CCT generates a codebook automatically based on both Behavior Model and Topology. Problems can be readily output by inputting a group of events as symptoms to the codebook. However, CCT fails to provide any mechanisms for determining the time range of events to be inputted to the codebook. Thus, there is no means for determining correct points in time for generated events. If the input range for an event is incorrect, then the results that are produced may also be incorrect. For example, when an error occurs one day ago, and then another error occurs today, it is often realistic to conclude that the two errors are unrelated. However, CCT analysis is typically carried out including past events whenever an event occurs, and thus, the same event must be processed repeatedly, which can affect accuracy of the analysis and greatly increase the cost of calculating the root cause of an event.
Traditional Expert System
The “Rete Matching Algorithm” is an example of the traditional expert system. This kind of expert system acts as a rule-based matching algorithm. As discussed by B. Schneier in “The Rete Matching Algorithm”, incorporated herein by reference below, the Rete algorithm was created in the late 1970s to speed up comparisons for pattern matching. Prior to the Rete algorithm, studies showed that older systems spent as much as 90% of their time performing pattern matching. These systems would iterate through the pattern matching process, taking each rule in turn, looking through the data memory to determine whether the conditions for a particular rule were satisfied, and then proceed to the next rule. Since then, methods have been found to index data elements and rule conditions for increasing efficiency, which speeds up program execution, but which still requires iterating through a series of rules and data elements. The Rete algorithm eliminates a large part of this iterative step, and hence, is a substantial improvement over competing algorithms.
The Rete matching algorithm avoids iterating through the data elements by storing the current contents of the conflict set in memory, and only adding and deleting items from the conflict set as data elements are added and deleted from the memory. For example, in a conventional iterative pattern matching system, when adding two almost identical rules, the entire iterative process is carried out for each of the rules. However, in the Rete algorithm, the almost identical rules can be treated as being redundant due to Rete's tree-structured sorting network. The Rete pattern complier builds a network of individual sub-conditions. It first looks at each element of a production rule individually, and builds a chain of nodes that tests for each attribute individually. Then, it looks at comparisons between elements, and connects the chain of nodes with new nodes. Finally, terminator nodes are added to signal that all the conditions for the production rule have been satisfied. Additional production rules are grafted on to the same network. If they have no test in common, they do not interact at all.
Related art includes U.S. Pat. No. 4,727,487, entitled “Resource allocation method in a computer system”, to Masui et al.; U.S. Pat. No. 4,761,746, entitled “Dynamic reconstruction method for discrimination network”, to Tano et al.; U.S. Pat. No. 4,868,763, entitled “Knowledge-based system having plural processors”, to Masui et al.; U.S. Pat. No. 5,146,537, entitled “Method for judging whether conditions are satisfied by using a network having a plurality of nodes representing the conditions”, to Tano et al.; U.S. Pat. No. 5,353,385, entitled “Inference method and apparatus for use with knowledge base system and knowledge base system support method and apparatus using the inference method and apparatus”, to Tano et al.; U.S. Pat. No. 7,107,185, entitled “Apparatus and method for event correlation and problem reporting”, to Yemini et al.; U.S. Pat. No. 7,254,515, entitled “Method and apparatus for system management using codebook correlation with symptom exclusion”, to Ohsie et al.; Schneier, B., “The Rete Matching Algorithm”, Dr. Dobb's Journal, Dec. 5, 2002; and Forgy, C. L., “Rete: A fast algorithm for the many pattern/many object pattern matching problem”, ARTIFICIAL INTELLIGENCE, Vol. 19, no. 1, 1982, pp. 17-37, the entire disclosures of which are incorporated herein by reference.