1. Field of the Invention
The present invention relates to a method and system of data mining and, more particularly, to a method and system of data mining for identifying the occurrence of unusual events.
2. Description of the Related Art
Organizations collect huge volumes of data from their daily operations. This wealth of data is often under-utilized. Data mining is a known technology used to discover patterns and relationships in data. It involves the process of analyzing large amounts of data and applying advanced statistical analysis and modeling techniques to the data to find useful patterns and relationships. These patterns and relationships are used to discover key facts that can drive decision making. This helps companies reap rewards from their data warehouse investments, by transforming data into actionable knowledge and by revealing relationships, trends, and answers to specific questions that cannot be easily answered using traditional query and reporting tools.
Data mining, also known generically as “knowledge discovery,” is a relatively young, interdisciplinary field that cross-fertilizes ideas from several research areas, including machine learning, statistics, databases, and data visualization. With its origins in academia about ten years ago, the field has recently captured the imagination of the business world and is making important strides by creating knowledge discovery applications in many business areas, driven by the rapid growth of on-line data volumes. Fayyad et al. (“From Data Mining to Knowledge Discovery: An Overview,” in Chapter 1, Advances in Knowledge Discovery and Data Mining, American Association for Artificial Intelligence (1996)) presents a good, though somewhat dated, overview of the field. Bigus (Data Mining with Neural Networks, McGraw-Hill (1996)) and Berry and Linoff (Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley & Sons (1997)), among others, have written introductory books on data mining that include good descriptions of several business applications.
With the widespread use of networked computers and the Internet, “electronic attacks” on such systems have become a serious problem. These unauthorized intrusions into computer systems and networks place unfortunate limitations on the users of the network systems, erode consumer confidence in providing confidential information to utilize such systems (e.g., for use in electronic commerce) and require the implementation of expensive and often cumbersome security measures to limit or stop such intrusions.
Intrusion detection systems have been developed to collect information from a variety of system and network sources and analyze the information for signs of unauthorized access to the system or network. A detailed white paper published by the ICSA Intrusion Detection Systems Consortium in the spring of 1999, entitled “An Introduction to Intrusion Detection and Assessment”, incorporated herein fully by reference, provides a detailed discussion of the benefits and limitations of intrusion detection systems.
Commercial intrusion-detection systems (also referred to as “sensors” herein) often generate massive amounts of data. The data generally comprises “alarms” and sequences of alarms; the alarms merely indicate the occurrence of an event on the network or system. The occurrence of an alarm does not necessarily indicate that an intrusion event has occurred. An intrusion would likely generate many alarms or a particular sequence of alarms, which in their totality would indicate an intrusion and possibly generate a higher level “intrusion alarm.” Users of these systems use simple filters to screen alarms in order to cope with their sheer volume; little else is usually done with this data.
A good example of a user of intrusion detection systems is IBM. IBM provides real-time intrusion detection services to clients worldwide. Commercially available sensors, such as NetRanger from Cisco Systems, are deployed on customer networks. These sensors detect the occurrence of a variety of events which in turn trigger alarms. All alarms are sent over the Internet to IBM's Network Operations Center (NOC) in Boulder, Colo., which provides 7×24 first-level monitoring and database storage of the alarms. Operators at the NOC deal with thousands of incoming alarms from each sensor every day, using sophisticated filtering and summarization tools to determine in real-time the extent and source of potential attacks, i.e., to determine which alarms or alarm sequences indicate intrusion events. The filtering and summarization tools are typically developed in-house on an ad hoc basis and comprise tools that deal with the following: (i) receiving alarms as they come in from various sources around the world; (ii) translating alarms to a standard, vendor-independent format using internally-developed mapping tables and rules; (iii) assigning priority levels to alarms based on internally-developed static tables and rules; (iv) storing alarms into a database mainly for forensic purposes; (v) updating summary-level data (e.g., keeping a count of various alarm types); (vi) filtering alarms based on the assigned priority levels and summary thresholds; and (vii) presenting filtered streams of alarms and updated summaries to human operators so that the operators can decide whether an intrusion incident is occurring.
Even though these tools perform admirably, the success of their use depends critically on careful hand-crafting of the filtering and summarization rules. As the number of sensors deployed increases, the data volume rises, and this task becomes harder to keep up with. By necessity, most manually crafted rules are fairly simple, placing a lot of weight on priority levels statically pre-assigned to different alarm types and largely ignoring the context in which the alarms occur, such as precursor or successor alarms, the source or destination of the network traffic triggering the alarms, the timing of events, and the originating sensor. This is problematic, since it is often the context that determines the severity of an alarm or sequence of alarms, and failure to consider the context leads to one or both of (a) many false positive alarms or (b) many false negative alarms.
Accordingly, it would be desirable to have a method and system for identifying (a) commonly occurring events, event sequences, and event patterns and (b) their corresponding context occurring in a historical data set, such as frequently-occurring alarm events in an intrusion detection system that are not, in fact, indicia of unusual events, and based on these identified event sequences or patterns, identifying unusual events, sequences, or patterns occurring in a current data set.