The present disclosure relates generally to a method, apparatus and a computer program for analysing events in a computer system.
Computer systems continuously generate events as effects of internal actions within hardware or software, and as part of effects of external stimuli from users, neighbouring computer systems etc. These events are often referred to as logs, traps or messages and there exists an ecosystem of solutions that uses these information sources for solving various computer related problems such as trouble shooting, detecting failure, detecting security violations etc. Monitoring of information technology infrastructure, such as servers, switches, routers and workstations, business systems and transaction systems, not limiting other types of similar units, typically consists of the following approaches:                Network monitoring: Analyses data flow to see network load, latency, path etc. May warn if too high/low load is identified.        Health monitoring: Collects information from the source, which for example may include: CPU, disk space, memory usage, and may warn if this approaches 100% (or “too much”). It may also warn if a system cannot be reached by continuously polling infrastructure.        Intrusion Detection: May for example analyse data flows and logs and match these against a signature database consisting of signatures of known attack-patterns of hackers, viruses and/or trojans. If a match is found, a warning is triggered.        Intrusion Prevention: Has the same approach as Intrusion Detection, but does not only warn, it also potentially drops the offending packet.        Log Management: Collects logs from infrastructure and generates reports from user input values. It enables warnings from configured thresholds and levels. Some of it also parses the logs into more “human-readable” strings.        Security Information and Event Management: Analyses logs, traffic flows, vulnerability information and other security related information sources typically by performing “event correlation”. Has often search functionality, similar to log management, but potentially more limited.        Activity and performance monitoring: Analyses events from applications (typically web) for application-specific performance problems and user activity        Anti fraud: Analyses events from financial systems and/or web systems in order to detect if a session is fraudulent.        
There are collective problems with these approaches.                Typically they are niche, attempting to solve a very specific problem such as detecting if a host is out of disk, or that a specific security attack occurs—this means that users will need to adopt multiple solutions to cover their “issues” and potentially there will be gaps in between them which will cause blind spots.        Typically, they carry to some degree a definition-based approach, whereby they need input from user or vendor on the structure of data and what to look for. This means that they are limited to finding pre-defined situations, which means that they are very reliant on good users and good default vendor “definitions”. The act of defining “bad” situations creates a great degree of bias in the detection capability which typically results in both false-positives and false-negatives.        Typically the definition-based approach also brings a varying degree of manual maintenance for the solution to continuously perform as specified, and this is mainly due to the fact that they rely on events having certain defined structures, that finding events requires defining search filters and triggering alerts requires definition of “if-then-else”-type rules. All these definitions will need to be updated with time, sometimes as frequently as every day.        Typically, they are “effect-based”, meaning they detect a situation that is the “effect” of a problem and not the actual cause. They mostly have problems describing the events leading up to the “effect” which often results in the solution not assisting the actual process of “fixing” the issue.        Since they rely on definitions of “bad”, false-positives are typically a problem. This often leads to the fact that users of these solutions try to “limit” the false positives by 1) making less data input (less logs) and 2) “tuning” thresholds and rules so they don't trigger. Both of these approaches results in a decreased detection capability, leading to limiting the value of the solution in case of incidents.        Infrastructure monitoring best-practices such as ISO 27001 (ISO/IEC 27001:2005—Information technology—Security techniques—Information security management systems—Requirements) and PCI DSS (Payment Card Industry Data Security Standard) require continuous log analysis which is very difficult and expensive using the above products since it means manually searching the logs.        Some of the solutions are not real time oriented at all—e.g. search based—and thus don't fit well in creating understanding of critical situations as they happen or are underway.        The systems that have real time functionality are typically very rule set based, and work by memorising some sequences of events according to a user-defined rule set. The result of this detection process is typically very basic and “simple”, since the rule set needs to be “understandable” to the user. Solutions for monitoring information systems today are not satisfactory since they fail at detecting many incidents, and bring little/limited value once incidents happen, and have large continuous costs associated with them.        