Methods for determining the degree of an anomaly in multi-dimensional data in a system in which multi-dimensional data are sequentially generated or obtained have been proposed in various fields in the real world.
The following documents are considered herein:    [Patent Document 1]    Published Unexamined Patent Application No. 10-254899    [Non-Patent Document 1]    David Marchette, “A Statistical Method for Profiling Network Traffic”, Workshop on Intrusion Detection and Network Monitoring, 1999, pp. 119-128    [Non-Patent Document 2]    Kenji Kita, Kazuhiko Tsuda, and Masami Shishibori, “Information retrieval algorithm”, Kyoritsu Shuppan, 2002    [Non-Patent Document 3]    Nadeem Ahmed Syed, Huan Liu, and Kah Kay Sung, “Handling concept drifts in incremental learning with support vector machines”, “Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 317-321, 1999    [Non-Patent Document 4]    Tom M. Mitchel, “Machine Learning”, McGraw Hill, Chapter 6, 1997    [Non-Patent Document 5]    A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Generative model-based clustering of directional data”, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 19-28, 2003    [Non-Patent Document 6]    Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, Richard A. Harshman, “Indexing by Latent Semantic Analysis”, Journal of the American Society of Information Science, Vol. 41, No. 6, pp. 391-407, 1990    [Non-Patent Document 7]    C. D. Manning and H. Schutze, “Foundation of Statistical Natural Language Processing”, MIT Press, Section 7.2.1, 2000    [Non-Patent Document 8]    T. Joachims, “Text categorization with support vector machines: Learning with many relevant features”, Proceedings of ECML, 1998    [Non-Patent Document 9]    H. Li and K. Yamanishi, “Text classification using ESC-based stochastic decision lists”, Proceedings of the ACM CIKM, pp. 122-130, 1999    [Non-Patent Document 10]    M. Ghil, M. R. Allen, M. D. Dettinger, K. Ide, D. Kondrashov, M. E. Mann, A. W. Robertson, A. Saunders, Y. Tian, F. Varadi, and P. Yiou, “Advanced Spectral Methods for Climatic Time Series”, Reviews of Geophysics, 40 (2002), pp. 1-41, 2002    [Non-Patent Document 11]    N. Golyandina, V. Nekrutkin and A. Zhigljavsky, “Analysis of Time Series Structure: SSA and Related techniques”, Chapman and Hall/CRC 2001
For example, in a study, the probability of connection requests to each port number of a computer system is represented by a feature vector and an intrusion into the computer system is detected by using a clustering technique for the feature vectors (see Non-Patent Document 1).
The method described above can be used in text information processing. For example, in a text classification problem, a text is represented by a vector which has elements consisting of the occurrence frequency of words or elements consisting of the quantity obtained by converting the occurrence frequency of words by using tf-idf. This type of modeling of text data is called a vector-space model (see Non-Patent Document 2). Then, the similarity, such as the cosine measure, of the text vector of text to be classified newly to the typical vector of a each category is calculated and, from the similarity, whether or not the text belongs to the category is determined or whether or not classification performed by using a classifier is proper (see Patent Document 1, for example). This processing classifies the text vector according to its dissimilarity to the known typical vector and therefore it can be said that the text classification problem is a form of anomaly detection.
(1) Directional Data:
In anomaly detection in multi-dimensional data, the data in which an anomaly is to be detected is often normalized to directional data. Directional data can be defined as a vector whose L2 norm is normalized to a constant value such as 1 (that is, a vector where the sum of squares of the elements is equal to a constant value such as 1). Therefore, the directional data is data only whose direction has a meaning. For example, in a text classification problem, a text vector, which is a multi-dimensional vector, is generated based on the occurrence frequency of a word. It is necessary to generate the directional vector by normalizing the text vector to a certain norm in order to properly compare the similarities because the larger the total number of the words in the text, the greater the norm of the text vector.
While in some cases, a vector may be used whose L1 norm, instead of L2 norm, is normalized in the meaning of normalizing the probability (see Non-Patent Document 1), each element can be readily reformulated so as to represent the probability amplitude (that is, a quantity the square of the absolute value of which gives probability) and therefore the problem using a vector whose L1 norm is normalized resolves itself into a directional data problem.
As has been described above, the problem of detecting an anomaly by comparing directional data provided by normalizing monitored data with a reference vector can be applied to various fields. Hereinafter, an object in which anomaly is detected in this way is called a “dynamic system.”
(2) Distance Measure of Directional Data:
As the distance measure used in comparison between a directional data item and a reference vector, the cosine measure defined by Expression (1) is widely used.[Expression 1]Z≡l−rTu  (1)
Here, r denotes a predetermined reference vector (directional data), u denotes an observation vector (directional data) observed from a monitored dynamic system, and the superscript T denotes transpose. As apparent from Expression (1), z is equal to 0 if the observation vector matches the reference vector, but is equal to 1 if the observation vector is orthogonal to the reference vector. Because of this nature, z can be used as the index of the dissimilarity of the observation vector to the reference vector.
(3) Anomaly Detection:
In a text classification problem, if the dissimilarity z in Expression (1) of an observation vector u obtained from given text data to the reference vector r corresponding to a given category exceeds a threshold zth (z>zth), it is usually determined that the text data does not belong to the category. That is, a reference vector r and a threshold zth are set for each category and the dissimilarity z is compared with the threshold zth for each category to determine whether or not the text data belongs to the category.
The following are problems to be solved by the invention. The art described above have the following problems:
(1) Difficulty of Setting (the) Thresholding Condition
In the anomaly detecting method described above, appropriate determination criteria must be set. However, it is difficult to set appropriate determination criteria with the prior arts. More specifically, if the nature of a text to be classified and the set of the texts to be classified are known, the anomaly thresholding condition can be found based on the result of classification of the text data. However, in a case where unknown text data arrive sequentially online, t is difficult to set a threshold zth properly even if the values of already arrived data are available. This is because it is difficult to properly evaluate the size of each cluster resulting from the classification. In conventional approaches, the threshold zth is determined typically by comparing the deviation from the average value with standard deviation, assuming that the distribution of dissimilarity z is substantially equivalent to normal distribution. Generally, this assumption does not hold. Especially when directional data is used, it is not appropriate to use this assumption because the directional data is normalized.
(2) Difficulty of Updating the Thresholding Condition
In a state where observational data arrive in succession online, it is desirable that the thresholding condition be updated appropriately. However, except in special cases where the dissimilarity z is normal distribution, it is difficult to obtain the distribution function of the dissimilarity z. Accordingly, it is difficult to respond to changes in the thresholding condition over time with conventional approaches. For example, in a text classification problem, it is a significant challenge to capture the drift of categories from a practical standpoint. To respond to changes in the thresholding condition over time is one of the main subjects in machine learning (see for example Non-Patent Document 3), and it is desired to solve the problem.
(3) Difficulty of Dealing with Directional Data
The degree of freedom of directional data is smaller than the dimension of the vector space by 1 due to the condition that its norm is constant. Therefore, directional data is seemingly easier to deal with than vectors without normalization. However, dimensional data is statistically more difficult to deal with. That is, if each dimension of a multi-dimensional vector is independent, its dispersion can be properly modeled by using multi-dimensional normal distribution. The normal distribution is considerably easy to deal with mathematically. For example, it is well known that a multi-dimensional vector classification problem can be formulated mathematically into a maximum likelihood estimation problem of mixed normal distribution and can be readily solved with the so-called expectation maximization method (see for example Non-Patent Document 4). Therefore, it may be possible to deal with directional data with normal distribution by neglecting normalization conditions and assuming the degrees of freedom to be independent of each other. However, it is empirically known that this method does not provide an appropriate model.
In this way, because directional data u is normalized, natural distribution of the directional data u is not multi-dimensional normal distribution. Letting the direction corresponding to the reference vector r be the mean direction, the distribution that provides the maximum entropy for directional data u distributed around that direction is the von Mises-Fisher distribution shown in the Expression (2).
[Expression 2]
                              f          ⁡                      (                                          u                ❘                r                            ,              Σ                        )                          =                                            ∑                              1                -                                  N                  /                  2                                                                                                      (                                      2                    ⁢                                                                                  ⁢                    π                                    )                                                  N                  /                  2                                            ⁢                                                I                                                            N                      /                      2                                        -                    1                                                  ⁡                                  (                                      1                    /                    ∑                                    )                                                              ⁢                      exp            (                                          r                T                            ⁢                              u                /                ∑                                      )                                              (        2        )            
Here, N denotes the dimension of the reference vector and the directional data, Σ denotes a scalar parameter that defines the variance of the von Mises-Fisher distribution, and Iv(c) denotes the modified Bessel function of the first kind of v stages.
Considering that the maximum entropy principle gives the multi-dimensional normal distribution if the constraint of norm is removed, it is obvious that the von Mises-Fisher distribution is the most natural (most expressive) distribution for the directional data. Accordingly, the anomaly detection problem for the directional data u can be formulated in principle by using the von Mises-Fisher distribution or its mixture models.
However, because the von Mises-Fisher distribution is difficult to deal with mathematically (especially because it contains the modified Bessel function), it has not thoroughly been discussed in the context of anomaly detection in the past. It was not until recently that formulation with the expectation maximization method of the von Mises-Fisher distribution was discussed in the context of clustering (see Non-Patent Document 5). Moreover, because the maximum likelihood estimation of the von Mises-Fisher distribution involves complex mathematical operations including approximation of the special function in Expression (3) and the solution to the maximum likelihood equation is given as the solution to a transcendental equation, it is difficult to provide rules for updating parameters that determine the distribution online.