Huge amounts of data are generated by many sources. “Data” refers to a collection of organized information, the result of experience, observation, measurement, streaming, computed, sensed or experiment, other information within a computer system, or a set of premises that may consist of numbers, characters, images, or as measurements of observations.
Data is structured in known formats. Non-structured data can be transformed to structured formats. When data is transferred or received continuously or intermittently in a time dependent fashion, the data is said to be “streamed” in a data stream. “Packet-oriented” data refers to a collection of basic units of structured information in a data stream. In communication networks, packet oriented data includes headers and payload. “Connection-oriented” data refers to a collection of packet-oriented data.
Static and dynamic “high dimensional big” data (HDBD) is common in a variety of fields. Exemplarily, such fields include finance, energy, transportation, communication networking (i.e. protocols such as TCP/IP, UDP, HTTP, HTTPS, ICMP, SMTP, DNS, FTPS, SCADA, wireless and Wi-Fi) and streaming, process control and predictive analytics, social networking, imaging, e-mails, governmental databases, industrial data, healthcare and aviation. HDBD is a collection of “multidimensional data points” (MDPs). A MDP, also referred to as “sample”, “sampled data”, “point”, “vector of observations”, or “vector of measurements”, is one unit of data from the original (source, raw) HDBD that has the same structure as the original data. A MDP may be expressed by Boolean, integer, floating, binary or real characters. HBDB datasets (or databases) include MDPs that may be either static or may accumulate constantly (dynamic). MDPs may include (or may be described by) hundreds or thousands of parameters (or “features”).
The term “feature” refers to an individual measurable property of phenomena being observed. A feature may also be “computed”, i.e. be an aggregation of different features to an derive average, a standard deviation, etc. “Feature” is also normally used to denote a piece of information relevant for solving a computational task related to a certain application. More specifically, “features” may refer to specific structures ranging from simple structures to more complex structures such as objects. The feature concept is very general and the choice of features in a particular application may be highly dependent on the specific problem at hand. Features are usually numeric, but may be structural (e.g. as strings also called identifiers).
In an example of communication networks in which each network connection can be described by tens, hundreds and even thousands of parameters, the straightforward features are the different fields in the protocols in different network layers. The extraction of features from the metadata and from the payload of a connection leads to a significant increase in dimensionality. “Metadata” is “data about data” of any sort in any medium. An item of metadata may describe an individual MDP or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema.
In another example of intelligence applications, a person under surveillance may be described by tens, hundreds and even thousands of features, for example by information about the person's phone calls, location, e-mail activities, financial activities, etc.
HDBD, with all its measured or streamed features and available sources of information (e.g. databases), may be classified as heterogeneous HDBD or simply as “heterogeneous data”. The terms “heterogeneous” means that the data includes MDPs assembled from numbers and characters having different meanings, different scales and possibly different origins or sources. Heterogeneous data may change constantly with time, in which case it is referred to as “heterogeneous dynamic” data.
In known art, HDBD is incomprehensible to understand, to draw conclusions from, or to find in it anomalies that deviate from a “normal” behavior. In this description, the terms “anomaly”, “abnormality”, “malfunction”, “operational malfunction”, “outlier”, “deviation”, “peculiarity” and “intrusion” may be used interchangeably. “Anomaly detection” refers to a process that identifies in a given dataset patterns that do not conform to established or expected normal behavior. The detected anomaly patterns often translate into critical and actionable information in many different application domains, such as cyber protection, operational malfunctions, performance monitoring, financial transactions, industrial data, healthcare, aviation, monitoring or process control. It is therefore clear that anomaly detection has huge practical commercial, security and safety implications, to name a few.
Known machine-learning-based anomaly detection methods include usually two sequential steps: training and detection. The training step identifies the normal behavior in training data, defines a distance (affinity or metric) and provides some normal characteristic (profile) of the training data. The affinity may be used to compute deviation of a newly arrived MDP (“NAMDP”) from the normal data profile. The detection step computes the affinities for the NAMDP and classifies the NAMDP as either normal or abnormal.
Diffusion maps (DM) are known and described in R. R. Coifman and S. Lafon in Applied and Computational Harmonic Analysis, 21(1), 5-30, 2006. The DM process described therein embeds data into a lower-dimension space such that the Euclidean distance between MDPs in the embedded space approximates the diffusion distance in the original (source) feature space. The dimension of the diffusion space is determined by the underlying geometric structure of the data and by the accuracy of the diffusion distance approximation.
Out-of-sample extension (OOSE) is also known. One way to have an efficient computation of OOSE is based on using Interpolative Decomposition (ID), described in H. Cheng et al., “On the compression of low rank matrices”, SIAM Journal on Scientific Computing, 26(4), 1389-1404, 2005. ID is a deterministic algorithm. A faster randomized ID (RID) version appears in P. Martinssonet et al., “A randomized algorithm for the decomposition of matrices”, Applied and Computational Harmonic Analysis, 30(1), 47-68, 2011. RID can be accelerated by using either the Farthest Point Sampling (FPS), see T. F. Gonzalez, Clustering to minimize the maximum inter-cluster distance, Theoretical Computer Science, 38, 293-306, 1985), or the Weighted Farthest Point Sampling (WFPS) algorithm, described in Y. Eldar et al., “The farthest point strategy for progressive image sampling”, IEEE Trans. Image Processing, 6, 1315, 1997. The WFPS algorithm can be accelerated by using the Inverse Fast Gauss Transform (IFGT) described in C. Yanget. al., “Improved fast Gauss transform and efficient kernel density estimation”, Computer Vision, 2003, Proceedings, Ninth IEEE International Conference, 664-671, 2003, that uses the Fast Multiple Method (FMM) described in L. Greengard and V. Rokhlin, “A fast algorithm for particle simulations”, Journal of Computational Physics, 73(2), 325-348, 1987.
Anomaly detection in HDBD is critical and in extensive use in a wide variety of areas. For example, anomaly detection is used to identify malicious activities and operational malfunction in network intrusions or financial fraud, customer behavioral change and manufacturing flaws in energy facilities. In financial activities, anomaly detection is used to detect fraud, money laundering and risk management in financial transactions, and to identify abnormal user activities. Anomaly detection in these areas may also be used to detect suspicious terrorist activities.
Another area is customer behavioral analysis and measurement, practiced for example in marketing, social media and e-commerce. In these areas, attempts are made to predict behavior intention based on past customer attitude and social norms. These predictions, in turn, will drive eventually targeted advertisements and online sales. Anomaly detection in this field would relate to monitoring of changes in consumers behavior, which may avoid substantial market losses.
Yet another area involves critical infrastructure systems or process control. In this area, many sensors collect or sense continuously several measurements in a predetermined time unit. When these sensors are connected through a communication network, the area is related to “Industrial Internet” and “Internet of Things”. Fusion (combination, unification) of these measurements leads to the construction of a HDBD dataset. Here, anomaly detection may be used exemplarily for fault detection in critical infrastructure or for inspection and monitoring, and enables to perform predictive analytics. While monitoring critical infrastructure resources, anomalies originated from cyber threats, operational malfunction or both can be detected simultaneously.
In an illustrative example of anomaly detection use, an entity such as a network, device, appliance, service, system, subsystem, apparatus, equipment, resource, behavioral profile, inspection machine, performance or the like is monitored per time unit. Assume further that major activities in incoming streamed HDBD obtained through the monitoring are recorded, i.e. a long series of numbers and/or characters are recorded in each time unit. The numbers or characters represent different features that characterize activities in or of the entity. Often, such HDBD has to be analyzed to find specific trends (abnormalities) that deviate from “normal” behavior. An intrusion detection system (“IDS”), also referred to as anomaly detection system or “ADS”, is a typical example of a system that performs such analysis. Malfunction is another typical example of an abnormality in a system.
An IDS attempts to detect all types of malicious network traffic and malicious computer uses (“attacks”) which cannot be detected by conventional protection means such as firewalls (rules) and IDS (signature based). These attacks may include network attacks against vulnerable services, data driven attacks on applications, host based attacks such as privilege escalation, unauthorized logins and access to sensitive files, malware (viruses, Trojan horses, backdoors and worms) and other sophisticated attacks that exploit every vulnerability in the data, system, device, protocol, web-client, resource and the like. A “protocol” (also called communication protocol) in the field of telecommunications is a set of standard rules for data representation, signaling, authentication and error detection required to send information over a communication channel. The communication protocols for digital computer network communication have many features intended to ensure reliable interchange of data over an imperfect communication channel. A communication protocol means basically certain rules so that the system works properly. Communication protocols such TCP/IP and UDP have a clear structure. SCADA protocols have also a clear structure.
A network IDS (NIDS) tries to detect malicious activities such as denial of service (DoS), distributed DoS (DDoS), port-scans or even attempts to crack into computers by monitoring network traffic while minimizing the rate of false alarms and miss-detections. A NIDS operates by scanning all the incoming packets while trying to find suspicious patterns. If, for example, a large number of requests for TCP connections to a very large number of different ports is observed, one can assume that someone is committing a port scan at some of the computers in the network.
A NIDS has unlimited ability to inspect only incoming network traffic. Often, valuable information about an ongoing intrusion can be learned from outgoing or local traffic as well.
Some attacks may even be staged from inside the monitored network or network segment (“internal attacks”), and are therefore not regarded as incoming traffic at all. However, they are considered as major threats that have to be treated. Internal attacks can be either intentional or un-intentional.
Similar problems in identifying abnormalities in data are encountered in many network unrelated applications as mentioned above. One example relates to the control or monitoring of a process that requires detection of any unusual occurrences in real-time. Another example is the real-time (online) detection of operational malfunctions in SCADA protocols. Analysis of SCADA protocols can discover either malware insertion or operational malfunction or both.
Many of the current methods used to extract useful intelligence from HDBD require extensive computational resources, are time consuming, and, when used for anomaly detection, fail to detect anomalies before they become operational. Therefore, there is a need for, and it would be advantageous to have anomaly detection methods and systems that require less computational effort and are faster. There is also a need for anomaly detection methods and systems that can detect unknown anomalies representing unknown attacks or malfunctions. In other words, there is a need for methods and systems that perform automatic or “un-supervised” anomaly detection, defined as detection that does not require rules, signatures, patterns, domain expertise or semantics understanding of the input data. In addition, the number of false alarms should be as low as possible.