Huge amounts of data are generated by many sources. “Data” refers to a collection of organized information, the result of experience, observation or experiment, other information within a computer system, or a set of premises that may consist of numbers, characters, images, or as measurements of observations. Data usually comes from conversion of physical quantities from observations or measurements into symbols (also called sampling). Data also refers to a collection of numbers, characters, images or outputs from devices that convert physical quantities into symbols. Such data is typically further processed by a human or input into a computer, stored and processed there, or transmitted to another human or computer.
Data is structured in known formats. When data is transferred or received continuously or intermittently in a time dependent fashion, the data is said to be “streamed” in a data stream. “Packet-oriented” data refers to a collection of basic units of structured information in a data stream. In communication networks, packet oriented data contain headers and payload. “Connection-oriented” data refers to a collection of packet-oriented data.
In many cases, the data is high-dimensional (also called multi-dimensional), where a data dimension n (or “N”)>3. If source (“original” or “raw”) data is described for example by 25 measured parameters (“features”) that are sampled (recorded, measured) in every predetermined time interval (e.g. every minute), then the data is of dimension n=25. Multi-dimensional data is a collection of data points. A “data point” (also referred to herein as “sample”, “sampled data”, “point”, “vector of observations”, “vector of measurements”) is one unit of data of the original (source) multi-dimensional data that has the same structure as the original data. A data point maybe expressed by boolean, integer and real characters.
In this invention, “features” refers to the individual measurable properties of the phenomena being observed. Features are usually numeric, but may be structural such as strings. “Feature” is also normally used to denote a piece of information which is relevant for solving the computational task related to a certain application. More specifically, features can refer to specific structures ranging from simple structures to more complex structures such as objects. The feature concept is very general and the choice of features in a particular application may be highly dependent on the specific problem at hand.
In particular, high-dimensional data, with all its measured features and available sources of information (e.g. databases), may be classified as heterogeneous high-dimensional data or simply as “heterogeneous data”. The term “heterogeneous” means the data includes data points assembled from numbers and characters having different meanings, different scales and possibly different origins or sources. The process of finding similar areas that identify common (similar) trends is called clustering, and these areas are called clusters. Heterogeneous data may change constantly in time, in which case the data is called “heterogeneous dynamic data”.
In known art, high-dimensional data is incomprehensible to understand, to draw conclusions from or to find anomalies in that deviate from a “normal” behavior. Throughout this invention, the terms “anomaly”, “abnormality” and “intrusion” are used interchangeably. Similarly, the terms “cluster” and “manifold” are also used interchangeably.
Network Intrusion Detection
Assume for example that an entity such as a network, device, appliance, service, system, subsystem, apparatus, equipment, resource, behavioral profile, inspection machine, performance or the like is monitored per time unit. Assume further that major activities in incoming streamed multi-dimensional data obtained through the monitoring are recorded, i.e. a long series of numbers and/or characters are recorded in each time unit. The numbers or characters represent different features that characterize the activities in or of the entity. Often, such multi-dimensional data has to be analyzed to find specific trends (abnormalities) that deviate from “normal” behavior. An intrusion detection system (“IDS”, also referred to as anomaly detection system or “ADS”) is a typical example of a system that performs such analysis.
An intrusion detection system attempts to detect all types of malicious network traffic and malicious computer uses (“attacks”) which cannot be detected by conventional protection means such as firewalls. These attacks may include network attacks against vulnerable services, data driven attacks on applications, host based attacks such as privilege escalation, unauthorized logins and access to sensitive files, mal-ware (viruses, Trojan horses, and worms) and other sophisticated attacks that exploit every vulnerability in the data, system, device, protocol, web-client, resource and the like. A “protocol” (also called communication protocol) in the field of telecommunications is a set of standard rules for data representation, signaling, authentication and error detection required to send information over a communication channel. The communication protocols for digital computer network communication have many features intended to ensure reliable interchange of data over an imperfect communication channel. A communication protocol means basically certain rules so that the system works properly. Communication protocols such TCP/IP and UDP have a clear structure.
A network intrusion detection system (NIDS) tries to detect malicious activities such as DoS, distributed DoS (DDoS), port-scans or even attempts to crack into computers by monitoring network traffic while minimizing the rate of false alarms and miss-detections. A NIDS operates by scanning all the incoming packets while trying to find suspicious patterns. If, for example, a large number of requests for TCP connections to a very large number of different ports is observed, one can assume that someone is committing a port scan at some of the computers in the network.
A NIDS has unlimited ability to inspect only incoming network traffic. Often, valuable information about an ongoing intrusion can be learned from outgoing or local traffic as well. Some attacks may even be staged from inside the monitored network or network segment (“internal attacks”), and are therefore not regarded as incoming traffic at all. However, they are considered as major threats that have to be treated. Internal attacks can be either intentional or un-intentional.
A NIDS has to handle large networks by processing and analyzing packets from and to many (hundreds and thousands) of network devices. In these networks, a human operator is assigned to the task. The operator has to decide if the network functions properly or if some immediate action needs undertaking. However, the operator is incapable of understanding, compiling and processing huge amounts of data or making fast decisions because of the huge volume of data. This problem can be looked at as a data mining problem—finding patterns that deviate from normal behavior in an ocean of numbers and information that is constantly dynamically changed. The operator cannot handle malicious attacks and malicious usage of networks because: these attacks can develop and evolve slowly; more and more protocols in network environments are encrypted; analysis of the payload is impractical due to encryption and privacy violation; there are rapid changes in protocols and there is an avalanche of new protocols and applications (many per year); network applications become more and more masqueraded and thus difficult to identify; identification of unauthorized applications becomes more difficult; there is a growing number of hidden attacks and applications under HTTP and P2P protocols that try to “hijack’ systems, and the like. All of these make it more difficult to detect malicious uses of network systems.
IDS and NIDS have become integral components in security systems. The challenge is to perform online IDS and NIDS without miss-detections and false alarms. Throughout the rest of this disclosure, “online” is used among other things to mean an algorithm that can efficiently process the arrival of new samples from high bandwidth networks in real-time. To achieve online intrusion detection, most systems use signatures of intrusions which are developed and assembled manually after a new intrusion is exposed and distributed to the IDS clients. This approach is problematic because these systems detect only already-known intrusions (yesterday's attacks) but fail to detect new attacks (zero day attacks). In addition, they do not cover a wide range of high quality, new, sophisticated emerging attacks that exploit many network vulnerabilities.
Similar problems of identifying abnormalities in data are encountered in many network unrelated applications, e.g. in the control or monitoring of a process that requires detection of any unusual occurrences in real-time. One example is the real-time (online) detection of mastitis in dairy farming. Mastitis is expressed by abnormal somatic cell counts (SCC) and its detection may be significantly aided by detection of abnormal SCC counts or of other milk parameters during milking. Automatic mastitis detection using different statistical methods is reviewed by David Cavero Pintado, PhD Dissertation, Christian Albrechts University, Kiel, Germany, 2006. In the past, statistical methods were applied to data which included number of milkings, electrical conductivity, milk yield, milk flow rate and SCC. However, such statistical methods fail to provide adequate warning for the appearance or occurrence of mastitis.