Information Technology (IT) systems, methods and computer program products, including, for example, computer networks, have grown increasingly complex with the use of distributed client/server applications, heterogeneous platforms and/or multiple protocols all on a single physical backbone. The control of traffic on networks is likewise generally moving from centralized information systems departments to distributed work groups. The growing utilization of computer networks is not only causing a move to new, high speed technologies, but is, at the same time, making the operation of computer networks more critical to day to day business operations. Furthermore, as computer systems become more distributed and, thereby, more inter-related, the number of different components of a system that may result in problems typically increases. For example, application integration, including integration across heterogenous systems, has increased the complexity of systems and the interdependence of systems while also increasing reliance on such systems, for example, for mission critical applications.
This increase in the complexity of systems may make problem determination and/or resolution more complex. In conventional systems, components, such as applications, middleware, hardware devices and the like, generate data that represents the status of the component. This component status data will, typically, be consumed by some management function utilized to monitor the system and/or for problem analysis/resolution. The management function may, for example, be a user reading a log file or it may be a management application that is consuming the data for analysis and/or display. In conventional systems, components and component owners are responsible for determining what data is provided, in terms of format, completeness and/or order of the data as well as the meaning of the data.
Such an ad hoc approach to component status information may be convenient for the component developer, however, the complexity of the management function may be increased. For example, the management function, may need some context for a status message from the component. In particular, the management function will, typically, need to know what a data message from a component represents, the format of the data, the meaning of the data and what data is available. For example, the management function may need to know that a particular message (e.g., message “123”), from a particular component (e.g., component “ABC”) has a certain number of fields (e.g., three fields) and what data is in each of the fields (e.g., a first field is a timestamp, a second field is a queue name and a third field is a value for the queue name). Typically, no data other than the data provided by the component can be derived from the management system. Furthermore, this approach also assumes that the consumer of the data knows, not only the significance of the data fields, but also the format of the fields (e.g., the timestamp is in the mm/dd/yy format).
Furthermore, the cause of the problem that is reported by an error message may be reported by a component other than the component with the problem. Thus, a management function may need to know, not only the existence of the component, but the relationship between the components that are managed. Without such knowledge, the management function may not recognize that the source of the component is not the component reporting the error.
One difficulty that may arise from the use of differing component status formats is in the analysis of problems for differing components or from different versions of a component. Knowledge bases have conventionally been used to map component status data, such as error log messages, that are reported by components to symptoms and eventually to fixes for problems. For example, there are symptom databases utilized by International Business Machines Corporation, Armonk, N.Y., that map WebSphere error log messages to symptoms and fixes. These databases typically work on the assumption that if you see a specified error message (e.g., message “123”) from a specified component (e.g., component “XYZ”), then a particular symptom is occurring (e.g., the performance is slow) and a predefined remedy (e.g., increase the parameter “buffsize” to 10) will likely fix the problem.
Furthermore, the use of differing component status formats in the analysis of problems for differing components or from different versions of a component may make it difficult for Information Technology (IT) specialists to write correlation rules to obtain status information about the system from data provided by two different components. For example, if a first vendor of a monitoring tool includes a certain status information field when reporting the amount of occupied memory of a device and a second vendor does not include the same field, or includes the field but calls it something different or formats it differently, the information provided by the devices may be difficult to use. In other words, the IT specialist may have to convert the formats of the fields to a common format before the information in the fields could be used efficiently. Due to time constraints, IT specialists typically cannot afford to write development and validation rules that consider all conditions.