Information security is an active field of academic and industrial pursuit. With the news of data breaches by hackers, and data theft or exfiltration by rogue insiders a commonplace occurrence, it is unsurprising to see many academic and professional institutions focusing their efforts to develop tools and practices for securing their computing and network environments. These efforts are largely aimed at making computing networks and infrastructure more secure against exploitative attacks from global hackers as well as from accidental or intentional data theft attempts from the inside.
There are many ways of detecting security attacks on an IT infrastructure in the prior art. U.S. Pat. No. 9,094,288 to Nucci discloses a method for profiling network traffic including obtaining a signature library with multiple signatures. Each signature represents a data characteristic associated with a corresponding application executing in the network. Then based on a predetermined criterion, a group behavioral model associated with the signature library is generated. The group behavioral model represents a common behavior of multiple historical flows identified from the network traffic. The signatures correlate to a subset of the plurality of historical flows. Then a flow in the network traffic is selected for inclusion in a target flow set, where the flow matches the group behavioral model. This match is without correlation to any corresponding application of the signatures. The target flow set is analyzed to generate a new signature which is then added to the signature library.
U.S. Pat. No. 8,448,234 to Mondaeev teaches a method of determining whether a data stream includes unauthorized data. The data stream is analyzed using a hardware filter to detect the presence of one or more set of patterns in the data stream. It is determined whether a packet in the data stream belongs to one of the data flows to be further inspected based on the analysis of the data stream by the hardware filter. A set of rules is applied to the packet to produce a rule match if it is determined that the packet belongs to one of the data flows to be further inspected. The packet is analyzed to determine if the packet includes unauthorized data using software if the rule match indicates that the packet potentially includes unauthorized data.
U.S. Patent Publication No. 2012/0233222 to Roesch teaches a system that includes a sensor and a processor. The sensor is configured to passively read data in packets as the packets are in motion on the network. The processor operating with the sensor is configured to read the data from the sensor and to originate real-time map profiles of files and file data. The processor then performs correlation and inference from the read data from the sensor.
U.S. Patent Publication No. 2015/0163121 to Mahaffey discloses a system where data is collected from a set of devices. The data is then associated with the devices, mobile application programs (apps), web applications, users, or a combination of these. Then a norm is established using the collected data. The norm is then compared with the data collected from a specific device. If there is a deviation outside of a threshold deviation between the norm and the data collected from the particular device, a response is initiated.
Non-Patent reference, “A Hybrid Model for Network Security Systems: Integrating Intrusion Detection System with Survivability” by Bhaskar, dated September 2008 proposes a holistic approach to network security with a hybrid model that includes an Intrusion Detection System (IDS) to detect network attacks and a survivability model to assess the impacts of undetected attacks. A neural network-based IDS is proposed, where the learning mechanism for the neural network is evolved using genetic algorithm. Then the case where an attack evades the IDS and takes the system into a compromised state is discussed. A stochastic model is then proposed, which allows one to perform a cost/benefit analysis for systems security. This integrated approach allows systems managers to make more informed decisions regarding both intrusion detection and system protection.
Non-Patent reference “Network packet payload analysis for intrusion detection” by Mrdovic dated 2006, explores the possibility of detecting intrusions into computer networks using network packet payload analysis. Various issues with IDS are explained in the paper. An integrated approach to IDS building is suggested. Anomaly detection process improvements are recommended. Prevailing methods for network intrusion detection based on packet metadata, headers, are also compared with the approach proposed in the paper. Reasoning behind packed payload analysis for intrusion detection is also presented. Modeling of HTTP normal and anomalous payload using artificial neural networks is suggested as the best approach in the paper.
One shortcoming of prior art teachings is that they do not apply the techniques from signature based or anomaly based intrusion detection to the area of data exfiltration. While there have been numerous attempts to do binary analysis and packet analysis for malware/virus detection for identifying new attack vectors but none have been in the areas of Data Loss Prevention (DLP) or data exfiltration. Also most of the present techniques require complex sandboxing and n-gram analysis for analyzing content.
There has not been a successful attempt at building a hybrid data surveillance system that uses a holistic approach with supervised and unsupervised machine learning for analyzing user behavior, by examining the entirety of data. The prevailing techniques do not employ an effective clustering scheme for data packets in a conceptualized hypercube and its centroid. As a part of such analysis, there also is the need for identifying file standards associated with data packets to corroborate that the packets conform to the purported file standards. Further, there is the need of performing Deep Packet Inspection (DPI) as a part of such a packet analysis for the entirety of data. Further still, there is a need for analyzing the drift of the centroid of data packets on various dimensions of analyses in response to various events in the organization.