FIG. 4 shows an example of an application error 400 that occurs in a computer infrastructure. It is known that the application error 400 may have a number of different causes 410 and 410′ which are shown lead to different causality sequences 420 as shown in FIG. 4. It will be seen, for example, that by looking back in time it is at least theoretically possible to analyse the causation sequences 420 and identify the original cause 410 of the error, and eliminate other initial causes 410′ for the application error. The extreme complexity of modern computer infrastructures makes, however, this analysis a time-consuming task. A network administrator may have to review a large number of entries in a systems log to exclude possible causes of error.
Apparatuses and methods for analysing a computer infrastructure to analyse such possible causes of errors are known in the art. In these prior art apparatuses and methods, a structure of the computer infrastructure needs to be analysed. The analysing system identifies different types of messages that are typically sent from devices, such as routers or peripheral devices, within the computer infrastructure. The analysing system allows the structure of the computer infrastructure to be identified, even if the infrastructure is highly complicated and changeable. The analysing system is suitable for distributed application architectures.
It is possible to use the information about the infrastructure obtained through the analysing system to identify a malfunction or an error within part of the computer infrastructure by analysing an error message error or the lack of an expected message. Sometimes the analysing system will enable a diagnosis or report of the possible source of the malfunction within the computer infrastructure to an administrator of the computer infrastructure. In other cases it is not possible or it is difficult to diagnose or identify the malfunction within the part of the computer infrastructure. The network administrator may need to physically send a person to investigate, diagnose or identify the malfunction within the part of the computer infrastructure. Some of the prior art systems may require a detailed knowledge of the structure of the computer infrastructure. In particular, the addition and/or removal of users and/or new peripheral devices to the computer infrastructure will require a reprogramming of the analysing system. The reprogramming of the analysing system may need to be carried out on a regular basis as users and/or new peripheral devices are added to the computer infrastructure. The reprogramming of the analysing system is time-consuming as well as being liable to error.
Many institutions (for example financial institutions) rely upon and use extensive computer infrastructures that receive, process and accumulate a large amount of time-critical data from external sources. Examples of the external sources include, but are not limited to, information from the Bloomberg and the Thomson Reuters information providers. This data from external sources is distributed to the users of the computer infrastructure. The distribution of the data to the users of the computer infrastructure results in a large amount of data traffic in the computer infrastructure. Effective data distribution within the computer infrastructure is often critical for the operation of the financial institution. If, for example, one of the routers within the computer infrastructure malfunctions or breaks down, it is possible that one or more of the users of the computer infrastructure would not receive the data at all, or one or more of the users of the computer infrastructure would not receive the data in a timely manner. The ineffective data distribution within the computer infrastructure may lead to erroneous investment decisions being made. There is therefore a need to provide a system that can analyse and monitor data distribution malfunctions within a computer infrastructure.
Several prior art documents are known which address similar problems within computer infrastructures.
U.S. Pat. No. 7,451,210 (IBM) discloses a method for predicting the occurrence of future critical events in a computer cluster having a series of nodes. The method records system performance parameters, such as temperature, central processing unit utilisation time, processor number, user time, idle time, and input/output time, at predetermined intervals of time. The method also records the occurrence of past critical events, such as hardware or software errors or node failures, in the computer cluster. Time-series models and rule-based classification schemes are used to associate various system performance parameters with the occurrence of critical events and fed into a Bayesian network to predict the occurrence of future critical events in the computer cluster.
U.S. Pat. No. 7,280,988 (Netuitive) teaches a monitoring system for a computer infrastructure. The monitoring system of the U.S. Pat. No. 7,280,988 includes a baseline model that automatically captures and models normal system behaviour of the computer infrastructure. The monitoring system further includes a correlation model that employs a multivariate auto regression analysis to detect abnormal system behaviour of the computer infrastructure, and an alarm service that processes and scores a variety of alerts to determine an alarm status and to implement appropriate response action for the computer infrastructure when a threshold value is reached. The baseline model decomposes input variables into a number of components representing relatively predictable behaviours so that the erratic component of the computer infrastructure may be isolated for further processing. Modelling and continually updating of the components of the computer infrastructure separately permits an accurate identification of the input variable, which typically reflects abnormal patterns when they occur.
The baseline model of the Netuitive monitoring system is updated on an on-going basis that allows the model to adapt to changes in the normal operational pattern of the computer infrastructure. The Netuitive monitoring system does not maintain a large database of historical analysis and does not enable a periodic revaluation of the historical data. The Netuitive monitoring system is able to establish abnormal patterns and is able to present a list of events related to the abnormal patterns.
US patent application US 2006/0020924 (Lu and Chang) discloses a system, a method and a computer program product for monitoring performance of groupings of a computer infrastructure and applications using statistical analysis. The method, system and computer program monitors managed unit groupings of executing software applications and execution infrastructure to detect deviations in performance of the computer infrastructure. Logic acquires time-serious data from at least one managed unit grouping of the executing software applications and the execution infrastructure. Other logic derives a statistical description of expected behaviour from an initial set of acquired data. The logic derives a statistical description of operating behaviour from the acquired data that corresponds to a defined moving window of time slots. The logic compares the statistical description of expected behaviour with the description of operating behaviour and the logic reports predictive triggers. The logic identifies instances in which the statistical description of the operating behaviour deviates from the statistical description of the operating behaviour of the computer infrastructure to indicate a statistically significant probability letting operating anomaly exist within the at least one managed unit grouping corresponding to the acquired time period data.