The present invention relates to a data analysis technique and in particular, to a system for performing change analysis.
A change detection problem is one of the fields of data analysis. Traditionally, a change detection problem is formulated as a statistical test problem for the probability distribution in a data set of interest.
Detecting a change in data has great significance from an engineering standpoint. One application example of such change detection is an automobile fault diagnosis. More specifically, the automobile fault diagnosis aims to identify which component to replace in an automobile having some trouble. To this end, sensor data on the automobile under normal conditions is previously stored, and then the sensor data is compared with data obtained from the automobile having some trouble. In this diagnosis, what one wants to know is not information on whether the obtained data is different as a whole from the reference data, but detailed information on which parameter causes any change.
Another application example of change detection is a change analysis on a customer profile list. There is a demand, for example, for obtaining information to develop a marketing strategy, by comparing a list of customer profiles of the January-March quarter of this year with the list of customer profiles of the January-March quarter of the last year. Specifically, if a company is losing customers having a certain attribute, the company desires to know how to stop such customer attrition.
In this way, in the field of data analysis, data is analyzed to form a model of a structure underlying the data, and the model is used to obtain some kind of knowledge or to make decisions. For example, in analyzing customer profiles for marketing, value can be found in what one wants to know in information on which attribute is involved in a change between two data sets, assuming that there is the change between two lists, rather than in the information on whether or not there is a change between the two lists.
In other words, it is often the case that detailed information such as “how and what change has occurred” is more necessary than bulk information such as “whether or not a change has occurred” in practice. In the description of this application, to find out detailed information on how and what change has occurred is referred to as a change analysis.
In order to formally define this problem, let's consider a data set XA composed of NA vectors and a data set XB composed of NB vectors. The formulas of these data sets are expressed as follows.XA={xA(1),xA(2), . . . xA(NA)}XB={xB(1),xB(2), . . . xB(NB)}  [Formula 1]
Here, it is assumed that all the elements in each data set are same-dimensional (d-dimensional) vectors. The change detection problem is a problem for figuring out a difference between XA and XB to determine the significance of the difference, whereas the change analysis problem is a problem for expressing, given XA and XB, a rule explaining a difference between XA and XB by use of data attributes. These two problems are solved without being given any prior information as to the presence/absence of a difference. Accordingly, in machine learning terms, these problems are categorized in a class of unsupervised learning.
The foregoing problems are usually treated through two sample tests in statistics. There are several types of two sample tests. Here, consider a two sample test for a normal population, for example. Based on the assumption that the data set XA follows a multivariate Gaussian distribution of a variance-covariance matrix ΣA having a mean vector μA, this two sample test gives an answer stating whether or not the data set XB also follows the same Gaussian distribution. At this time, except for a special case where the two variance-covariance matrices can be assumed to be the same (that is, the case where ΣA=ΣB), it is not easy to find out which attribute of the data produces a difference between these two data sets. If a data sample is of a ten or more dimensional vector, it is almost impossible to identify an attribute producing a difference on the basis of the finite values of a covariance matrix. In other words, the two sample test gives a solution, as a result of hypothesis testing, indicating whether or not there is a difference between two distributions, but does not give any particular indicator from the viewpoint of the change analysis problem. This is also true in the case of using other formulations using any kinds of distance (a likelihood ratio, a Kolmogorov-Smirnov statistic, a Bregman divergence and the like) of a probability distribution.
Japanese Patent Application Publication No. 2001-22776 teaches a method of comparing sets of correlation rules extracted at different time points with each other in order to detect a temporal change in a database. However, this correlation rule is only a rule for simply counting co-occurrences of two items, and is not applicable to change analysis on the aforementioned automobile fault diagnoses and customer profile list. In addition, this method also has a problem in principle that an important rule may be buried in trivial rules.
Japanese Patent Application Publication No. 2003-316797 discloses that a feature analysis is performed on a multidimensional data set by focusing on a data sample of a particular dimension or data item of the multidimensional data set in which there is a change. In particular, in this technique, there are prepared a data extraction processing function and a designation table for storing a data sample of a dimension or a data item whose change is to be detected. The data extraction processing function determines whether or not the data of the dimension or data item designated in the designation table has changed from the data of the last extraction. If the data has changed, the data is stored in a multidimensional database for change analysis. This database is a database different from a multidimensional database for normal analysis. Then, the multidimensional database for change analysis is analyzed. However, Japanese Patent Application Publication No. 2003-316797 does not describe any specific technique for performing a change analysis even though describing the term of change analysis.
The following Fang Li, George C. Runge, Eugene Tuv, “Supervised learning for change-point detection,” International Journal of Production Research, Vol. 44, No. 14, 15 Jul. 2006, 2853-2868 and Victor Eruhimov, Vladimir Martyanov, Eugene Tuv, George C. Runger, “CHANGE-POINT DETECTION WITH SUPERVISED LEARNING AND FEATURE SELECTION,” ICINCO 2007—International Conference on Information in Control, Automation and Robotics disclose a change analysis in which feature selection is performed through supervised learning. According to this disclosed technique, the change analysis problem is reduced to a problem of supervised learning which receives input variables of a time-evolving process and outputs a time point. Thus, by means of this supervised learning, a variable having the average value changed is found from among multiple variables.
The technique of Fang Li, George C. Runge, Eugene Tuv, “Supervised learning for change-point detection,” International Journal of Production Research, Vol. 44, No. 14, 15 Jul. 2006, 2853-2868 and Victor Eruhimov, Vladimir Martyanov, Eugene Tuv, George C. Runger, “CHANGE-POINT DETECTION WITH SUPERVISED LEARNING AND FEATURE SELECTION,” ICINCO 2007—International Conference on Information in Control, Automation and Robotics, however, is capable of detecting only a change in the average value of a certain variable from among multiple variables, that is, only a relatively simple change. Thus, this technique has a limitation on its application to a complicated analysis target.
Note that, if, given labeled data, the problem can be originally dealt with as a problem of supervised learning (classification problem), it is not very difficult to technically combine change detection with classification. For instance, the following Hironori Takeuchi, Venkata Subramaniam, Tetsuya Nasukawa, and Shourya Roy, “Automatic Identification of Important Segments and Expressions for Mining of Business-Oriented Conversations at Contract Centers,” Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 458-467, Prague, June 2007 discloses a method using a classifier to investigate a change point in a data set targeting a database of call log at a call center.
However, this method handles a data set attached with labels of success and failure in reservation agreement and uses a technique of investigating how the labeled data changes on the basis of χ2 statistic. Accordingly, this method is not directly applicable to general data set other than frequency data set.
In addition, Japanese Application No. 2008-49729 is an example of the related art.