In recent years, a number of important data mining methods have been developed for problems such as clustering, similarity search, outlier detection, etc. All of these problems require the generation of data sets to test the quality of the results. Most current techniques generate the test data sets via techniques which generate data from standard probabilistic distributions, see, e.g., tools such as Datatect (available from Banner Software Inc. of Sacramento, Calif.), and companies such as Spatial Solutions Inc. (Hauppauge, N.Y.) and Crescent Consultants Limited (Derby, England).
For example, many data mining methodologies for the clustering problem assume that all the clusters in the data are of Gaussian shape and each data point is generated from one of these clusters. This may not often be the case with real data sets.
These techniques cannot capture the vagaries of real data sets effectively, which can contain clusters having arbitrary and irregular shape. Thus, a need exists for improved test data generation techniques which overcome these and other limitations.