The general concept of classifying data points has been used in a myriad of contexts and applications. In a signature recognition application, a group of data points must be classified in order to identify a particular pattern. A signature recognition system using data classification techniques can identify a particular human face from a crowd, regulate the flow of inventory in a manufacturing system, or perform medical diagnosis from patient data. In computer technology, classification of data points can be used for intrusion detection and computer security. An intrusion can be defined as any set of activities aimed at breaking the security of a computer network system. Intrusions may take many forms: external attacks, internal misuses, network-based attacks, information gathering, denial of service, etc. Information security against intrusions mainly involves intrusion prevention, detection, diagnosis, response, and system recovery stages.
Intrusion detection is an essential part of protecting computer systems from internal or external attacks. It detects the intrusive activities when they occur in the system. Intrusion detection systems are in great demand with the rapid growth of computer networks, the World Wide Web, and the consolidation of corporate business/integrated teams on information technology (“IT”) systems. The need for a reliable way to detect intrusion is compounded by the facts that security is still commonly an afterthought in systems design, that false-positive (false alarm) and false-negative (missed attack) rates remain high for certain intrusion types, and that attacks have become more complex, more significant in their impact and more difficult to defend against.
The main components of an intrusion detection system are the data collector, the analysis engine, and the system administrator involved in making final decisions. The core component is the analysis engine that is based on some intrusion detection algorithm. The intrusion detection algorithm collects incoming data points and compares them with the information and historical data from a computer system that comprise the patterns and profiles of normal activities and known intrusive activities. Then, based on these known patterns and profiles, the intrusion warning level of the current event is determined. The higher the intrusion warning level, the higher the possibility that the current activity of concern is from an intrusive scenario. Intrusion detection systems have been developed and implemented using various algorithms from a wide range of areas such as statistics, machine learning, and data mining.
An important criterion for evaluating an intrusion detection system is detection performance, which includes the false positive rate, detection probabilities and detection ranges for various intrusion types. Some other criteria are the speed of detection and granularity of data processing (e.g. real-time or batch-mode).
In general, each record of the intrusion detection data is a (p+1)-tuple with the attribute variable vector X in p dimensions and target variable XT. Each attribute variable is numeric or nominal, and represents a certain attribute of the events occurring in the computer systems such as user identification (ID), time stamp, and service name. Target variable XT can be a binary variable with value 0 or 1, where 0 represents normal activity while 1 represents intrusive activity. Target variable XT can also be a multi-category nominal variable with such categories as NORMAL, SYNFLOOD, IPSWEEP, MAILBOMB, and so on. For training data, XT is known for each record and determined from historical data, i.e. where particular attribute variable has been found to be intrusive or non-intrusive to allow assignment of XT. In detection or classification, XT is determined from the attribute variables, thus the attributes are also called predictor variables.
Existing intrusion detection systems focus on two kinds of activity data from an information system: network traffic data and computer audit data. Network traffic data contain records about data packets communicated between host machines, and capture activities over networks. The attributes of the network traffic data may include destination and source addresses, type of network service, outcome, possible error message, and others. Computer audit data records activities on individual host machines with attributes such as process ID, command type, and user ID. Regardless of the type of data used it may have nominal attributes such as event type, user ID, process ID, command type, remote IP address, and numeric variables such as the time stamp, central processing unit (CPU) load, and the service frequencies. Feature selection methods such as frequency counting are often applied to the raw data to produce the input for detection algorithms. Data from computer systems have features that intrusion detection algorithms must address.
For large volumes, intrusion detection systems generally have to process a very large volume of data from practical systems. The data from a computer or network can easily contain millions of records over a short period of time. In addition, the dimensions of each record can extend into the hundreds. Intrusion detection algorithms must be scalable and efficient in handling such data for those real-time systems.
For changing patterns, data increases tremendously with the rapid expansion of the computer networks and applications. The profiles of normal and intrusive activities change over time and new patterns appear constantly. Thus, a practical intrusion detection system has to adapt, modify and add new entries to its underlying model over time.
For complex attribute variables, various types of attribute variables including numerical, ordinal and nominal variables appear in data. Numeric variables such as the time stamp, intensity or frequency of certain services, are very common in intrusion detection input data, as well as nominal variables such as user ID, port ID or command name. The relationship among these attributes may be very complicated. Some attributes may be highly correlated with other attributes. In addition much noise exists in such data. The data results from not only intrusive activities, but also normal activities. The distribution model for normal and intrusive activities may be unclear. All these features of the data require robust intrusion detection algorithms capable of handling various types of variables.
In anomaly detection and signature recognition, there are two major types of intrusion detection approaches in practical use: anomaly detection, and signature recognition or pattern matching. Anomaly detection attempts to learn the normal behavior of a subject in a computer system and to build a profile for it. A subject may be a user, a host machine or a network. The activities are classified as attacks if they deviate significantly from the normal profile. The techniques used in anomaly detection include logic-based profiling, artificial neural networks, regression, computer immunology, Markov chains, Bayesian networks, hidden Markov models and statistics-based profiling.
A weakness of anomaly detection is that false positives are often given if the anomalies are caused by behavioral irregularities instead of intrusions. Signature recognition is better at handling irregularities but cannot detect novel intrusions. Hence, anomaly detection and signature recognition techniques are often used together to complement one another.
The signature recognition method attempts to recognize the signatures (patterns) of normal and intrusive activities that can be discovered from training data or human experience. Signature recognition algorithm types include string matching, state transition, Petri nets, rule-based systems, expert systems, decision trees, association rules, neural networks and genetic algorithms. The signatures are matched with incoming data to determine the nature of the activities, thus predicting detection by matching these patterns with new data. Signature recognition techniques include two types: programmed or manual systems and self-learning or automatic systems.
For programmed systems, the information related to the patterns and models of normal and intrusive activities in a computer system must be collected before being fed to the systems manually. By being presented training examples, self-learning systems learn to automatically deduce what constitutes normal and intrusive behaviors. These systems then distinguish the attacks from normal activities using this information.
As discussed above, intrusion detection systems generally process a very large volume of data from information systems. The profiles of normal and intrusive activities change over time and new patterns appear constantly. Thus, a practical intrusion detection system has to adapt, modify and add new entries to its underlying model over time. Moreover, such data includes a lot of noise. All these features make it difficult to manually program all the normal and intrusive patterns into a system.
Despite the popularity of the above systems, there are many weaknesses associated with the present systems. Specifically, genetic algorithms and neural networks are not scalable for large data sets due to their manipulation of large populations in the form of genes or neurons, and their high computation cost. Association rules analysis is good at handling nominal variables, but is incapable of handling numeric values. Scalability is a serious problem for association rules analysis if there are many different items in the data or the data contains many records. The Bayesian network used in eBayes TCP handles only nominal variables, needs a lot of prior knowledge of the system when it builds the model and the user configuration when it applies the batch-mode adaptation of the model, and has a high computation cost for modeling a complex system. Decision tree is a very popular data mining technique and a promising tool for intrusion detection application, but no decision tree has both abilities. Thus, incremental learning and scalability are issues for decision trees with regard to the computation and storage cost. Thus far, none of the known algorithms can fully meet these requirements.