1. Field of the Invention
The present invention relates generally to the field of computer systems software and computer network security. More specifically, it relates to software for detecting intrusions and security violations in a computer system using statistical pattern analysis techniques.
2. Discussion of Related Art
Computer network security has been an important issue for all types of organizations and corporations for many years. Computer break-ins and their misuse have become common features. The number, as well as sophistication, of attacks on computer systems is on the rise. Often, network intruders have easily overcome the password authentication mechanism designed to protect the system. With an increased understanding of how systems work, intruders have become skilled at determining their weaknesses and exploiting them to obtain unauthorized privileges. Intruders also use patterns of intrusion that are often difficult to trace and identify. They use several levels of indirection before breaking into target systems and rarely indulge in sudden bursts of suspicious or anomalous activity. If an account on a target system is compromised, intruders may carefully cover their tracks as not to arouse suspicion. Furthermore, threats like viruses and worms do not need human supervision and are capable of replicating and traveling to connected computer systems. Unleashed at one computer, by the time they are discovered, it is almost impossible to trace their origin or the extent of infection.
As the number of users within a particular entity grows, the risks from unauthorized intrusions into computer systems or into certain sensitive components of a large computer system increase. In order to maintain a reliable and secure computer network, regardless of network size, exposure to potential network intrusions must be reduced as much as possible. Network intrusions can originate from legitimate users within an entity attempting to access secure portions of the network or can originate from xe2x80x9chackersxe2x80x9d or illegitimate users outside an entity attempting to break into the entity""s network. Intrusions from either of these two groups of users can be damaging to an organization""s computer network.
One approach to detecting computer network intrusions is analyzing command sequences input by users or intruders in a computer system. The goal is to determine when a possible intrusion is occurring and who the intruder is. This approach is referred to broadly as intrusion detection using pattern matching. Sequences of commands (typically operating system or non-application specific commands) and program or file names entered by each user are compared to anomalous command patterns derived through historical and other empirical data. By performing this matching or comparing, security programs can generally detect anomalous command sequences that can lead to detection of a possible intrusion.
FIG. 1 is a block diagram of a security system of a computer network as is presently known in the art. A network security system 10 is shown having four general components: an input sequence 12; a set of templates of suspect command sequences 14; a match component 16; and an output score 18. Input sequence 12 is a list of commands and program names entered in a computer system (not shown) in a particular order over a specific duration of time. The commands entered by a user that are typically external to a specific user application (e.g., a word processing program or database program) can be broadly classified as operating system level commands. The duration of time during which an input sequence is monitored can vary widely depending on the size of the network and the volume of traffic. Typical durations can be from 15 minutes to eight hours.
Template set 14 is a group of particular command sequences determined to be anomalous or suspicious for the given computer system. These suspect command sequences are typically determined empirically by network security specialists for the particular computer network within an organization or company. They are sequences of commands and program names that have proved in the past to be harmful to the network or are in some way indicative of a potential network intrusion. Thus, each command sequence is a template for an anomalous or harmful command sequence. Input sequence 12 and a command sequence template from template set 14 are routed to match component 16.
Component 16 typically uses some type of metric, for example a neural network, to perform a comparison between the input sequence and the next selected command sequence template. Once the match is performed between the two sequences, score 18 is output reflecting the closeness of the input sequence to the selected command sequence template. For example, a low score could indicate that the input sequence is not close to the template and a high score could indicate that the two are very similar or close. Thus, by examining score 18, computer security system 10 can determine whether an input sequence from a network user or hacker is a potential intrusion because the input sequence closely resembles a known anomalous command sequence.
Many computer network security systems presently in use and as shown in FIG. 1 have some significant drawbacks. One is often an overly complicated and inefficient matching metric or technique used to compare the two command sequences. The definition of xe2x80x9cclosenessxe2x80x9d with these metrics is typically complicated and difficult to implement. Another drawback is also related to the matching metric used in matching component 16. Typically, matching metrics presently employed for intrusion detection in network security systems end their analysis after focusing only on the command sequences themselves. They do not take into account other information that may be available to define the closeness or similarity of the command sequences, which might lead to a better analysis.
Tools are therefore necessary to monitor systems, to detect break-ins, and to respond actively to the attack in real time. Most break-ins prevalent today exploit well known security holes in system software. One solution to these problems is to study the characteristics of intrusions and from these, to extrapolate intrusion characteristics of the future, devise means of representing intrusions in a computer so that the break-ins can be detected in real time.
Therefore, it would be desirable to use command sequence pattern matching for detecting network intrusion that has matching metrics that are efficient and simple to maintain and understand. It would be desirable if such matching metrics took advantage of relevant and useful information external to the immediate command sequence being analyzed, such as statistical data illustrative of the relationship between the command sequence and other users on the network. It would also be beneficial if such metrics provided a definition of closeness between two command sequences that is easy to interpret and manipulate by a network intrusion program.
To achieve the foregoing, methods, apparatus, and computer-readable medium are disclosed which provide computer network intrusion detection. In one aspect of the invention, a method of detecting an intrusion in a computer network is disclosed. A sequence of user commands and program names and a template sequence of known harmful commands and program names from a set of such templates are retrieved. A closeness factor indicative of the similarity between the user command sequence and the template sequence is derived from comparing the two sequences. The user command sequence is compared to each template sequence in the set of templates thereby creating multiple closeness factors. The closeness factors are examined to determine which sequence template is most similar to the user command sequence. A frequency feature associated with the user command sequence and the most similar template sequence is calculated. It is then determined whether the user command sequence is a potential intrusion into restricted portions of the computer network by examining output from a modeler using the frequency feature as one input. Advantageously, network intrusions can be detected using matching metrics that are efficient and simple to maintain and understand.
In one embodiment, the user command sequence is obtained by chronologically logging commands and program names entered in the computer network thereby creating a command log, and then arranging the command log according to individual users on the computer network. The user command sequence is identified from the command log using a predetermined time period. In another embodiment, the frequency of the user command sequence occurring in a command stream created by a network user from a general population of network users is determined. Another frequency value of how often the most similar sequence template occurs in a command stream created by all network users in the general population of network users is determined. The two frequency values are used to calculate a frequency feature.
In another aspect of the present invention, a method of matching two command sequences in a network intrusion detection system is described. A user sequence having multiple user commands is retrieved, along with a template sequence having multiple template commands. The shorter of the two sequences is transformed to match the length of the longer sequence using unique, reserved characters. A similarity factor is derived from the number of matches between the user commands and the template commands by performing a series of comparisons between the user sequence and the template sequence. Similarity factors between the user sequence and each one of the template sequences are stored. The similarity between the user sequence and each one of the template sequences is determined by examining the similarity factors, thereby reducing the complexity of the matching component of the computer network intrusion system. Advantageously, this method performs better than the prior art it is less complex and easier to maintain. In one embodiment, the similarity factor is derived by shifting either the user commands in the user sequence or the template commands in the template sequence before performing each comparison.
In another aspect of the invention, another method of matching two command sequences in a network intrusion detection system is described. A user sequence having multiple user commands is retrieved, along with a template sequence having multiple template commands. A user substring and a template substring are created. The user substring has user commands found in the template sequence and the template substring has stored commands found in the user sequence. The number of alterations needed to reorder either the user substring or the template substring to have the same order as one another is saved. The number of alterations needed to make the two substrings the same is indicative of the similarity between the user sequence and each one of the template sequences from the set of template sequences.
In one embodiment, an alteration is an inversion in which adjacent user commands or template commands are inverted until the order of commands in the two substrings are the same. In another embodiment, the number of alterations is normalized by dividing the number of alterations by the number of alterations that would be needed to make the two substrings the same if the commands in the substrings were in complete opposite order.
In another aspect of the invention, a system for detecting an intrusion in a computer network is described. An input sequence extractor retrieves a user input sequence, and a sequence template extractor retrieves a sequence template from a template set. A match component compares the user input sequence and the sequence template to derive a closeness factor. The closeness factor indicates a degree of similarity between the user input sequence and the sequence template. A features builder calculates a frequency feature associated with the user input sequence and a sequence template most similar to the user input sequence. A modeler uses the frequency feature as one input and output from the modeler can be examined to determine whether the user input sequence is a potential intrusion.
In one embodiment of the invention, the user input extractor has a command log containing commands and program names entered in the computer network and arranged chronologically and according to individual users on the computer network. The user input extractor also contains a sequence identifier that identifies the user input sequence from the command log using a given time period. In another embodiment of the invention, the sequence template extractor also has a command log that contains, in a chronological manner, commands and program names entered in the computer network. The extractor also has a command sequence identifier for identifying a command sequence determined to be suspicious from the command log, and a sequence template extractor that creates the sequence template from the command sequence. In yet another embodiment, the match component has a permutation matching component that compares the user input sequence and a sequence template from the sequence template set. In yet another embodiment, the match component has a correlation matching component that compares the user input sequence template and a sequence template from the sequence template set.