This invention solves the problem of detecting, investigating and remediating security violations to IT (Information Technology) infrastructure and in particular cloud computing-based IT infrastructure. As an example, a security violation may involve the modification or replacement of a critical software component, such as an executable file by one of unknown provenance or an unexpected modification of a critical configuration file. Alternatively, a process that started from an executable file that is deemed authentic may exhibit suspicious behavior, such as attempting to read/modify files it does not normally require, attempting to connect to known malicious websites and/or DNS servers, or spawning new processes. Such behaviors may indicate that the program has a vulnerability that is being exploited.
The problem with detecting such security violations is that it is very hard to reason about all security risks related to a computer systems as a consequence of a cyber-security threat (attack, infection, etc.), in particular about changes to files, behaviors of processes in a single system, or its behavior on the network, and even more so as one expands the scope to a large set of systems, e.g., a data center or cloud. Existing security mechanisms such as anti-virus (AV) systems and intrusion detection/prevention systems (IDS/IPS) generally suffer from high false negatives and/or false positives that limit their effectiveness due to the lack of precision. In particular, AV systems and signature-based IPS/IDS systems rely on malware and attack signatures and are thus often unable to catch zero-day vulnerabilities, multi-stage attacks and advanced persistent threats (APT) that leverage multiple steps, each one of which may appear, in isolation, to be benign. Behavior analysis-based systems were proposed to address these limitations; however, those systems often end up with relatively high false positive rates due to the limited access to global information required to build accurate models of system and process behavior.
As an alternative to detecting malicious behavior within a single computer system, a number of approaches aim to lock-down systems so as to prevent violations in the first place. These typically implement Mandatory Access Control (MAC) policies, for example by defining which subjects (processes) could access which objects (files, sockets, and the like). Example realizations of MAC policies include LINUX security modules (LSM) like SELINUX, TOMOYO LINUX and APPARMOR and operating systems like TRUSTED SOLARIS and TRUSTED AIX. A similar approach is taken by SYSTRACE (for LINUX) which aims to restrict a process' access to system calls. Note that these terms may be trademarks of the respective owners. For instance, LINUX is a trademark of Linus Torvalds and is a well-known operating system. All those systems have in common that they confine the behavior of the processes by comparing their behaviors during run time against a predefined or learned profile (policy). Policies can be derived during a learning phase that observes the processes running in a given system. However, those approaches usually suffer from one or more of the following drawbacks.
First, defining these policies has been proven to be time consuming and complex. The learning approach is limited by the amount of time it runs and the spectrum of valid behaviors that a program or process will exhibit during this time. Additionally modern IT systems have become much more agile and dynamic, making it harder for MAC-based systems to adapt. Second, those approaches work on each system independently and thus don't have the global view of the IT infrastructure. Moreover, these systems maintain an in-host view of events only, and these systems do not inspect network traffic activity from a single system as well as a collection of systems.
For instance, if multiple suspicious activities appear in multiple systems, this might be an indication of widespread malicious activity and thus could be used to increase the confidence level determining an out-of-profile behavior, both on the operating system level and the network level. Sharing a learned profile between different servers at different granularity for similar systems is also very inconvenient. Additionally, the approaches described above are not easily applicable to cloud-based IT infrastructure, where information from cloud Operations Support System (OSS) and/or Business Support System (BSS) systems as well as common hypervisors, virtual and physical networks provides additional valuable context and data about security related events.
Furthermore, those approaches lack mechanisms to close the loop between detection of policy violations and refinement of the policies and none of them provides an intuitive way for investigating incidents/violations.