Huge amounts of data are generated by many sources. “Data” refers to a collection of organized information, the result of experience, observation, measurement, streaming, computing, sensing or experiment, other information within a computer system, or a set of premises that may consist of numbers, characters, images, or as measurements of observations.
Static and dynamic “high dimensional big data” (HDBD) are common in a variety of fields. Exemplarily, such fields include finance, energy, transportation, communication networking (i.e. protocols such as TCP/IP, UDP, HTTP, HTTPS, ICMP, SMTP, DNS, FTPS, SCADA, wireless and Wi-Fi) and streaming, process control and predictive analytics, social networking, imaging, e-mails, governmental databases, industrial data, healthcare and aviation. HDBD is a collection of multi-dimensional data points (MDDPs).
A MDDP, also referred to as “sample”, “point”, “observation” or “measurement”, is one unit of data from the original (source, raw) HDBD. A MDDP may be expressed as a combination of numeric, Boolean, integer, floating, binary or real characters. HDBD datasets (or databases) include MDDPs that may be either static or may accumulate constantly (dynamic). MDDPs may include (or may be described by) hundreds or thousands of parameters (or “features”).
The terms “parameter” or “feature” refer to an individual measurable property of phenomena being observed. A feature may also be “computed”, i.e. be an aggregation of different features to derive an average, a median, a standard deviation, etc. “Feature” is also normally used to denote a piece of information relevant for solving a computational task related to a certain application. More specifically, “features” may refer to specific structures ranging from simple structures to more complex structures such as objects. The feature concept is very general and the choice of features in a particular application may be highly dependent on the specific problem at hand. Features can be described in numerical (3.14), Boolean (yes, no), ordinal (never, sometimes, always), or categorical (A, B, O) manner.
HDBD, with all its measured or streamed features and available sources of information (e.g. databases), may be classified as heterogeneous HDBD or simply as “heterogeneous data”. The terms “heterogeneous” means that the data includes MDDPs assembled from numbers and characters having different meanings, different scales and possibly different origins or sources. Heterogeneous data may change constantly with time, in which case it is referred to as “heterogeneous dynamic” data.
In this description, the terms “anomaly”, “abnormality”, “malfunction”, “operational malfunction”, “outlier”, “deviation”, “peculiarity” and “intrusion” may be used interchangeably. “Anomaly detection” refers to a process that identifies in a given dataset patterns that do not conform to established or expected normal behavior. The detected anomaly patterns often translate into critical and actionable information in many different application domains, such as cyber protection, operational malfunctions, performance monitoring, financial transactions, industrial data, healthcare, aviation, monitoring or process control. It is therefore clear that anomaly detection has huge practical commercial, security and safety implications, to name a few.
Known machine-learning-based anomaly detection methods include usually two sequential steps: training and detection. Training phases identify the normal behavior in training data, defines a distance (affinity or metric) and provides some normal characteristic (profile) of the training data. “Training data” is data of a finite size, used as a source for learning the behavior and the properties of the data. The affinity may be used to compute deviation of a newly arrived MDDP (“NAMDDP”) from the normal data profile. The detection step computes the affinities for the NAMDDP and classifies the NAMDDP as either normal or abnormal.
Anomaly detection in HDBD is critical and in extensive use in a wide variety of areas. For example, anomaly detection is used to identify malicious activities and operational malfunction in network intrusions or financial fraud, customer behavioral change and manufacturing flaws in energy facilities. In financial activities, anomaly detection is used to detect fraud, money laundering and risk management in financial transactions, and to identify abnormal user activities. Anomaly detection in these areas may also be used to detect suspicious terrorist activities.
Another area is customer behavioral analysis and measurement, practiced for example in marketing, social media and e-commerce. In these areas, attempts are made to predict behavior intention based on past customer attitude and social norms. These predictions, in turn, will drive eventually targeted advertisements and online sales. Anomaly detection in this field would relate to monitoring of changes in consumers behavior, which may avoid substantial market losses.
Yet another area involves critical infrastructure systems or process control. In this area, many sensors collect or sense continuously several measurements in a predetermined time unit. When these sensors are connected through a communication network, the area is related to “Industrial Internet” and “Internet of Things”. Fusion of these measurements leads to the construction of a HDBD dataset. Here, anomaly detection may be used exemplarily for fault detection in critical infrastructure or for inspection and monitoring, and enables to perform predictive analytics. While monitoring critical infrastructure resources, anomalies originated from cyber threats, operational malfunction or both can be detected simultaneously.
In an illustrative example of anomaly detection use, an entity such as a network, device, appliance, service, system, subsystem, apparatus, equipment, resource, behavioral profile, inspection machine, performance or the like is monitored. Assume further that major activities in incoming streamed HDBD obtained through the monitoring are recorded, i.e. a long series of numbers and/or characters are recorded and associated with time stamps respective of a time of recordation. The numbers or characters represent different features that characterize activities in or of the entity. Often, such HDBD has to be analyzed to find specific trends (abnormalities) that deviate from “normal” behavior. An intrusion detection system (“IDS”) also referred to as anomaly detection system or “ADS”, is a typical example of a system that performs such analysis. Malfunction is another typical example of an abnormality in a system.
Similar problems in identifying abnormalities in data are encountered in many network unrelated applications. One example relates to the control or monitoring of a process that requires detection of any unusual occurrences in real-time. Another example is the real-time (online) detection of operational malfunctions in SCADA protocols. Analysis of SCADA protocols can discover either malware insertion or operational malfunction or both.
To achieve online anomaly detection, some systems may use signatures and rules of intrusions, which are developed and assembled manually after a new anomaly is exposed and distributed. This approach may be problematic, because these systems detect only already-known intrusions (“yesterday's” attacks and anomalous malfunctions) but fail to detect new attacks (“zero-day” attacks). In addition, they do not cover a wide range of high quality, new, sophisticated emerging attacks that exploit many network vulnerabilities.
Many of the current methods used to extract useful intelligence from HDBD require extensive computational resources, are time consuming, and, when used for anomaly detection, fail to detect anomalies before they become operational. Therefore, there is a need for, and it would be advantageous to have anomaly detection methods and systems that require less computational effort and are faster. There is also a need for anomaly detection methods and systems that can detect unknown anomalies representing unknown attacks or malfunctions. In other words, there is a need for methods and systems that perform automatic or “un-supervised” anomaly detection, defined as detection that does not require rules, signatures, patterns, domain expertise or semantics understanding of the input data. In addition, the number of false alarms should be as low as possible.