The present invention relates to a sequential data examination method of determining whether or not sequential data belong to one or more categories.
In order to detect so-called “masquerade or spoofing” which gains an unauthorized access to a computer by stealing the password from a user and pretending to be that user, it is effective to use an anomaly detection system to examine if there is any anomaly in sequential data entered into the computer, namely, if the entered sequential data have been created by the masquerader or spoofer. Typically, a conventional anomaly detection system first creates a profile defining a normal user's behavior (features appearing in the user-created sequential data). It then determines whether or not the entered sequential data have been created by a normal user or masquerader by comparing a profile of entered sequential data to be tested with that of that user.
The sequential data to be tested typically include issued UNIX (registered trademark) commands and accessed files. The process of identifying the entered sequential data as normal or anomalous is divided into two steps. At the first step, features are extracted from the sequential data. At the second step, the extracted features are identified as normal or anomalous.
Typical conventional techniques of performing feature extraction (the first step) are “Histogram” and “N-grams”. In the histogram technique, frequency vectors of observed events within the sequential data are feature vectors. In the N-grams technique, N consecutive events are defined as one feature. [Non-Patent Documents 1 to 3]
Various approaches have been proposed as a technique of performing the second step, namely, identifying the extracted features as normal or anomalous. Such approaches typically include “Rule-based” [Non-patent Document 4], “Automaton” [Non-patent Document 5], “Bayesian Network” [Non-patent Document 6], “Naive Bayes” [Non-Patent Document 7], “Neural Network” [Non-patent Document 8], “Markov Model” [Non-patent Document 9], and “Hidden Markov Model” [Non-patent Document 10].
The inventors of the present invention have proposed another method called “Eigen Co-occurrence Matrix (ECM)” which captures dynamic information on a user's behavior and extracts features from the user's sequential data [Non-patent Document 11]. The ECM approach correlates events while taking account of the sequential data. The event correlation focuses on the event pair and represents correlations of all event pairs as co-occurrence matrices. In the co-occurrence matrix, the strength of the correlation of each event pair is represented by the distance over which the event pair spreads and the frequency at which that event pair occurs.
In the approach using the histogram, the feature is defined as a frequency vector of item (event) occurrences within a sequence. In the approach using the N-grams, the feature is defined as N consecutive items (events). There are problems with these conventional approaches. Dynamic information on a user's behavior appearing in the sequential data is not available. In other words, information on the user's behavior within a sequence, namely, characteristic features of each user defined by the types of events appearing within his/her sequence and the appearing order of these events are not available or dynamic information on the user's behavior is lost. In addition, only the features of a single event or adjacent events are available or only the features between adjacent events can be represented.
When using the ECM method proposed by the inventors of the present invention to identify an authorized user and masquerader, it is appropriate to employ a statistical pattern recognition technique in which a co-occurrence matrix is handled as a pattern. The most simplest pattern recognition is a technique based on pattern matching. When handling co-occurrence matrices as patterns, the patterns become highly dimensional. In the pattern matching, it is effective to extract features (which leads to compressed information) for pattern recognition. The specific technique proposed by the inventors of the present invention determines whether or not sequential data belong to one or more categories (or sequential data have been created by an authorized user) by computing the feature vectors from co-occurrence matrices and checking with a specified vector identification function to see if the Euclid distance between the sequential data and the reference feature vectors used for determination is below a threshold. Although this technique attains certain checking accuracy, there is a limit to improvement of the checking accuracy.
The non-patent documents referred to herein are:
[Non-patent Document 1] Ye, X. Li, Q. Chen, S. M. Emran, and M. Xu; “Probablistic Techniques for Intrusion Detection Based on Computer Audit Data”; IEEE Transactions of Systems Man and Cybernetics, Vol. 31, pp. 266-274, 2001
[Non-patent Document 2] S. A. Hofmeyr, S. Forrest and A. Somayaji; “Intrusion Detection using Sequences of System Calls”; Journal of Computer Security, vol. 6, pp. 151-180, 1998
[Non-patent Document 3] W. Lee and S. J. Stolfo; “A framework for constructing features and models for intrusion detection systems”; Information and Systems Security, vol. 3, pp. 227-261, 200
[Non-patent Document 4] N. Habra, B. L. Charlier, A. Mounji, and I. Mathieu; “ASAX: Software Architecture and Rule-Based Language for Universal Audit Trail Analysis”; In Proc. of European Symposium on Research in Computer Security (ESORICS), pp. 435-450, 1992[Non-patent Document 5] R. Sekar, M. Bendre, and P. Bollineni; “A Fast Automaton Based Method for Detecting Anomalous Program Behaviors”; In Proceedings of the 2001 IEEE Symposium on Security and Privacy, pp. 144-155, Oakland, May 2001[Non-patent Document 6] W. DuMouchel; “Computer Intrusion Detection Based on Bayes Factors for Comparing Command Transition Probabilities”; Technical Report TR91, National Institute of Statistical Science (NISS), 1999[Non-patent Document 7] R. A. Maxion and T. N. Townsend; “Masquerade Detection Using Truncated Command Lines”; In Prof. of the International Conference on Dependable Systems and Networks (DSN-02), pp. 219-228, 2002[Non-patent Document 8] A. K. Ghosh, A. Schwartzbard, and M. Schatz; “A study in using neural networks for anomaly and misuse detection”; In Proc. of USENIX Security Symposium, pp. 141-151, 1999[Non-patent Document 9] J. S. Tan, K. M. C., and R. A. Maxion; “Markov Chains, Classifiers and Intrusion Detection”; In Proc. of 14th IEEE Computer Security Foundations Workshop, pp. 206-219, 2001[Non-patent Document 10] C. Warrender, S. Forresto, and B. A. Pearlmutter; “Detecting Intrusions using System Calls: Alternative Data Models”; In IEEE Symposium on Security and Privacy, pp. 133-145, 1999[Non-patent Document 11] Mizuki Oka, Yoshihiro Oyama, and Kazuhiko Kato; “Eigen Co-occurrence Matrix Method for Masquerade Detection”; In Proceedings of 7th Programming and Applied Systems Workshop sponsored by Software Academy of Japan on Mar. 1, 2004