In many data mining applications, a critical part of learning the quality of the results is the testing process. In the testing process, one typically applies the data mining approach to a variety of real data sets. The results from these tests can provide a variety of quantifications and insights into the data mining process. Often, in order to explore such quantifications, it is necessary to test the data mining applications in a variety of ways. For this purpose, synthetic data sets are often quite useful. This is because synthetic data sets can be generated using a wide range of parameters. The use of parameters for changing the nature of the underlying data sets is useful in many scenarios in which the sensitivity of algorithms needs to be tested. Representative publications showing conventional arrangements of possible interest are: C. C. Aggarwal, “A Framework for Diagnosing Changes in Evolving Data Streams”, ACM SIGMOD 2003; and T. Zhang et al., “Fast Density Estimation Using CF-Kernel for Very Large Databases”, ACM KDD Conference, 1999.
While synthetic data sets have the advantage of being tunable in a wide variety of ways, they are often not as realistic as the data sets obtained in real applications. On the other hand, real data sets have the disadvantage that it is difficult the to vary the behavior of the data set without losing the effectiveness of the underlying data mining algorithm.
This leads to the question as to whether it is possible to generate data sets which have similar characteristics to those in the real domain. Such a problem is related to that of traffic generation in which one generates the new data set using the characteristics of the underlying real data set. Accordingly, a need has been recognized in connection with addressing this and related issues.