Present-day computer infrastructures are very complex and a large numbers of processes, often called threads, as part of applications are running on a plurality of devices in the computer infrastructure. The processes organise data processing on the devices including fetching, processing and storage of data. Many industry sectors and large numbers of companies rely on these complex computer infrastructures for their operation. Therefore, failures of one or more of the devices or applications running on the devices, or even the whole of the computer infrastructure, could cause a great deal of damage and financial losses.
Time-critical and/or other important data, for example financial and business information, is received from external sources and processed or stored on the computer infrastructure. The malfunctioning of at least one of the devices or even the computer infrastructure itself should be minimised or at least be detected as soon as possible. Because current internal computer infrastructures, especially in connection with other external networks, such as the World Wide Web, have reached such a great complexity, a minor problem of the functionality of one of the devices may impact the performance of the whole computer infrastructure and may cause a system crash of other ones of the devices or even the whole computer infrastructure.
Currently, IT-administrators conduct much of the forensic examination when problems on the devices in the computer infrastructure or in the applications running on the devices have already occurred by examining protocols or log files of past or recent running processes of the affected devices or applications. It could happen that one of the devices within the computer infrastructure malfunctions, such that one or more of the devices of the computer infrastructure would not receive or process all of the necessary data, or one or more of the devices of the computer infrastructure would not be able to process the data in a timely manner. This issue may cause erroneous or ineffective running of the applications within the computer infrastructure and may cause wrong decisions to be made within the company. This is particularly true if the decisions are made automatically by at least some of the devices of the computer infrastructure, such as but not limited to automated investment decisions made by computer devices of banks and financial institutions.
Computer analysis software for analysing such possible malfunctions or causes of errors within the computer infrastructure are known in the art. An analysing system of the prior art identifies, for example, different types of messages in connection with the running processes on the devices, such as servers, gateways or peripheral devices. Based on analysing the different types of messages the analysing system may estimate a recent functional status of the specific devices or of the computer infrastructure itself and thus identify the source of the malfunction.
Normally, the analysing system will enable directly a diagnosis or a report of the possible source of the malfunction within the computer infrastructure to the IT-administrator, e.g. the affected application or device. Depending on the complexity of the current computer infrastructure, it is very often not possible or it is very difficult to diagnose or identify the malfunction within even parts of the computer infrastructure. The IT administrator may need to physically investigate, diagnose or identify the malfunction within at least parts of the computer infrastructure. In complex situations, the IT administrator may have to investigate many different running processes of the possibly affected parts of the computer infrastructure.
An example of an analysing system is Splunk enterprise software that enables users to search, monitor and analyse data generated within the computer infrastructure. Splunk captures indices and correlates real-time data in a searchable repository. U.S. Patent Application Publication No. 2007/0118491 now issued as U.S. Pat. No. 7,937,344 issued May 3, 2011 (Baum et al, assigned to Splunk) describes such a system in more detail.
Co-pending U.S. patent application Ser. No. 12/965,226 (Dodson), published as U.S. Patent Application No. US 2011/0145400 A1 (now U.S. Pat. No. 8,543,689 issued Sep. 24, 2013), discloses an apparatus comprising a plurality of devices connected to the computer infrastructure. An analytics engine is connected to the computer infrastructure and analyses system message data within the computer infrastructure to create a unified multi-dimensional model of the computer infrastructure. The analytics engine is able to create a background model of a repetitive operational behaviour occurring within the computer infrastructure. The analytics engine is able to determine unexpected operational behaviour occurring within the computer infrastructure that may be indicative of a possible malfunction within the computer infrastructure.
U.S. Pat. No. 7,451,210, issued Nov. 10, 2008 (IBM) discloses a method for predicting the occurrence of future critical events in a computer cluster having a series of nodes. The method records system performance parameters, such as temperature, central processing unit utilisation time, processor number, user time, idle time, and input/output time, at predetermined intervals of time. The method also records the occurrence of past critical events, such as hardware or software errors or node failures, in the computer cluster. Time-series models and rule-based classification schemes are used to associate various system performance parameters with the occurrence of critical events and fed into a Bayesian network to predict the occurrence of future critical events in the computer cluster.
U.S. Pat. No. 7,280,988, issued Oct. 9, 2007 (Netuitive) teaches a monitoring system for a computer infrastructure. The monitoring system of the U.S. Pat. No. 7,280,988 includes a baseline model that automatically captures and models normal system behaviour of the computer infrastructure. The monitoring system further includes a correlation model that employs a multivariate auto regression analysis to detect abnormal system behaviour of the computer infrastructure, and an alarm service that processes and scores a variety of alerts to determine an alarm status and to implement appropriate response action for the computer infrastructure when a threshold value is reached. The baseline model decomposes input variables into a number of components representing relatively predictable behaviours so that the erratic component of the computer infrastructure may be isolated for further processing. Modelling and continually updating of the components of the computer infrastructure separately permits an accurate identification of the input variable, which typically reflects abnormal patterns when they occur.
The baseline model of the Netuitive monitoring system is updated on an on-going basis that allows the model to adapt to changes in the normal operational pattern of the computer infrastructure. The Netuitive monitoring system does not maintain a large database of historical analysis and does not enable a periodic revaluation of the historical data. The Netuitive monitoring system is able to establish an abnormal pattern and is able to present a list of events related to the abnormal pattern.
U.S. Patent Application Publication No. 2006/0020924 (U.S. patent application Ser. No. 11/152,966 filed Jun. 15, 2005, Lu and Chang) discloses a system, a method and a computer program product for monitoring performance of groupings of a computer infrastructure and applications using statistical analysis. The method, system and computer program monitors managed unit groupings of executing software applications and execution infrastructure to detect deviations in performance of the computer infrastructure. Logic acquires time-series data from at least one managed unit grouping of the executing software applications and the execution infrastructure. Other logic derives a statistical description of expected behaviour from an initial set of acquired data. The logic derives a statistical description of operating behaviour from the acquired data that corresponds to a defined moving window of time slots. The logic compares the statistical description of expected behaviour with the description of operating behaviour and the logic reports predictive triggers. The logic identifies instances in which the statistical description of the operating behaviour deviates from the statistical description of the operating behaviour of the computer infrastructure to indicate a statistically significant probability letting operating anomaly exist within the at least one managed unit grouping corresponding to the acquired time period data.