1. Field of the Invention
The present invention relates generally to the field of computer systems software and computer network security. More specifically, it relates to software for examining user and group activity in a computer network and for training a model for use in detecting potential security violations in the network.
2. Discussion of Related Art
Computer network security is an important issue for all types of organizations and enterprises. Computer break-ins and their misuse have become common features. The number, as well as sophistication, of attacks on computer systems is on the rise. Often, network intruders have easily overcome the password authentication mechanism designed to protect the system. With an increased understanding of how systems work, intruders have become skilled at determining their weaknesses and exploiting them to obtain unauthorized privileges. Intruders also use patterns of intrusion that are often difficult to trace and identify. They use several levels of indirection before breaking into target systems and rarely indulge in sudden bursts of suspicious or anomalous activity. If an account on a target system is compromised, intruders can carefully cover their tracks as not to arouse suspicion. Furthermore, threats like viruses and worms do not need human supervision and are capable of replicating and traveling to connected computer systems. Unleashed at one computer, by the time they are discovered, it is almost impossible to trace their origin or the extent of infection.
As the number of users within a particular entity grows, the risks from unauthorized intrusions into computer systems or into certain sensitive components of a large computer system increase. In order to maintain a reliable and secure computer network, regardless of network size, exposure to potential network intrusions must be reduced as much as possible. Network intrusions can originate from legitimate users within an entity attempting to access secure portions of the network or can originate from illegitimate users outside an entity attempting to break into the entity""s network often referred to as xe2x80x9chackers.xe2x80x9d Intrusions from either of these two groups of users can be damaging to an organization""s computer network. Most attempted security violations are internal; that is, they are attempted by employees of an enterprise or organization.
One approach to detecting computer network intrusions is calculating xe2x80x9cfeaturesxe2x80x9d based on various factors, such as command sequences, user activity, machine usage loads, and resource violations, files accessed, data transferred, terminal activity, network activity, among others. Features are then used as input to a model or expert system which determines whether a possible intrusion or violation has occurred. The use of features is well-known in various fields in computer science including the field of computer network security, especially in conjunction with an expert system which evaluates the feature values. Features used in present computer security systems are generally rule-based features. Such features lead to computer security systems that are inflexible, highly complex, and require frequent upgrading and maintenance.
Expert systems that use such features generally use thresholds (e.g., xe2x80x9cif-then-elsexe2x80x9d clauses, xe2x80x9ccasexe2x80x9d statements, etc.) to determine whether there was a violation. Thus, a human expert with extensive knowledge of the computer network domain has to accurately determine and assign such thresholds for the system to be effective. These thresholds and other rules are typically not modified often and do not reflect day-to-day fluctuations based on changing user behavior. Such rules are typically entered by an individual with extensive domain knowledge of the particular system. In short, such systems lack the robustness needed to detect increasingly sophisticated lines of attack in a computer system. A reliable computer system must be able to accurately determine when a possible intrusion is occurring and who the intruder is, and do so by taking into account trends in user activity.
As mentioned above, rule-based features can also be used as input to a model instead of an expert system. However, a model that can accept only rule-based features and cannot be trained to adjust to trends and changing needs in a computer network generally suffers from the same drawbacks as the expert system configuration. A model is generally used in conjunction with a features generator and accepts as input a features list. However, models presently used in computer network intrusion detection systems are not trained to take into account changing requirements and user trends in a computer network. Thus, such models also lead to computer security systems that are inflexible, complex, and require frequent upgrading and maintenance.
FIG. 1 is a block diagram depicting certain components in a security system in a computer network as is presently known in the art. A features/expert systems component 10 of a complete network security system (not shown) has three general components: user activity 12, expert system 14, and alert messages 16. User activity 12 contains xe2x80x9crawxe2x80x9d data, typically in the form of aggregated log files and is raw in that it is typically unmodified or has not gone through significant preprocessing. User activity 12 has records of actions taken by users on the network that the organization or enterprise wants to monitor.
Expert system 14, also referred to as a xe2x80x9crule-basedxe2x80x9d engine, accepts input data from user activity files 12 which acts as features in present security systems. As mentioned above, the expert system, a term well-understood in the field of computer science, processes the input features and determines, based on its rules, whether a violation has occurred or whether there is anomalous activity. In two simple examples, expert system 14 can contain a rule instructing it to issue an alert message if a user attempts to logon using an incorrect password more than five consecutive times or if a user attempts to write to a restricted file more than once.
Alert message 16 is issued if a rule threshold is exceeded to inform a network security analyst that a possible intrusion may be occurring. Typically, alert message 16 contains a score and a reason for the alert, i.e., which rules or thresholds were violated by a user. As stated above, these thresholds can be outdated or moot if circumstances change in the system. For example, circumstances can change and the restricted file mentioned above can be made accessible to a larger group of users. In this case an expert would have to modify the rules in expert system 14.
As mentioned above, the feature and expert system components as shown in FIG. 1 and conventional models used in conjunction with these components have significant drawbacks. One is the cumbersome and overly complex set of rules and thresholds that must be entered to xe2x80x9ccoverxe2x80x9d all the possible security violations. Another is the knowledge an expert must have in order to update or modify the rule base and the model to reflect changing circumstances in the organization. Related to this is the difficulty in locating an expert to assist in programming and maintaining all components in the system.
Therefore, it would be desirable to utilize a features list generator in place of a traditional expert system that can automatically update itself to reflect changes in user and user group current behavior. It would also be desirable to derive a training process for a model used in conjunction with a features generator to generate a score reflective of changing user behavior. It would also be desirable to have the training process or algorithm accurately read anomalous user behavior. Furthermore, it would be desirable to have such a features generator be self-sufficient and flexible in that it is not dependent on changes entered by an expert and is not a rigid rule-based system.
To achieve the foregoing, methods, apparatus, and computer-readable medium are disclosed which provide computer network intrusion detection. In one aspect of the present invention, a method of artificially creating anomalous data for creating an artificial set of features reflecting anomalous behavior for a particular activity is described. A feature is selected from a features list. Normal-feature values associated with the feature are retrieved. A distribution of users of normal feature values and an expected distribution of users of anomalous feature values are then defined. Anomalous-behavior feature values are then produced. Advantageously, a network intrusion detection system can use a neural-network model that utilizes the artificially created anomalous-behavior feature values to detect potential intrusions into the computer network.
In one embodiment a normal-behavior histogram indicating a distribution of users is defined. In another embodiment it is determined whether the activity corresponding to anomalous feature values are performed more or less frequently than normal. In yet another embodiment an anomalous-behavior histogram indicating an expected distribution of users is defined. In yet another embodiment the anomalous-behavior histogram is sampled. In yet another embodiment numerous anomalous-behavior feature values for each feature in the list of features is produced thereby creating a set of numerous anomalous-behavior feature values. In yet another embodiment an anomalous features list from a set of numerous anomalous-behavior feature values is derived.
In another aspect of the present invention a method of training a model for use in a computer network intrusion detection system is described. Anomalous feature values are defined and normal feature values are retrieved. A ratio of anomalous feature values and normal feature values is determined. A particular amount anomalous feature values and normal feature values are used as input to the model according to the ratio. By inputting the feature values based on the ratio, the model utilizes the particular amount of anomalous feature values and the particular amount of normal feature values to derive a score for a user activity.
In one embodiment, the model is trained using a neural network algorithm. In another embodiment, a probability factor for use in determining the ratio of anomalous feature values and normal feature values is derived. In another embodiment, an anomalous feature data list from numerous anomalous feature values is randomly selected. Similarly, a normal feature data list from numerous normal feature values is randomly selected. In yet another embodiment, a desired score is assigned for the selected feature data list used as input to the model.
In another aspect of the present invention, a computer network intrusion detection system for detecting possible violations in a computer network is described. The system includes user activity files containing records relating to activities performed by users on the system and historical data files containing user historical data and user group or peer historical data. A feature generator generates a features list and accepts as input the user historical data and the peer historical data. A model is trained to process the features list and output a final score indicative of whether a user activity is a potential intrusion or violation in the computer system.
In one embodiment the user historical data contains a series of user historical means and user historical standard deviations and the peer historical data contains a series of peer historical means and peer historical standard deviations. In another embodiment the features generator accepts as input the user historical means and the user historical standard deviations. In yet another embodiment the computer network intrusion detection system contains a set of features reflecting anomalous behavior. In yet another embodiment the computer network intrusion detection system has an anomalous feature data store for storing sets of anomalous feature values. In yet another embodiment the network intrusion detection system also includes a data selector for selecting either normal feature data or anomalous feature data based on a predetermined ratio and a neural network training component that accepts as input either the normal feature data or the anomalous feature data as determined by the data selector.