The design of data mining applications has received much attention in recent years. Examples of such applications include similarity determination and classification. In the context of data mining, it is assumed that we are dealing with a data set containing N objects in a dimensionality of d. Thus, in this data space, each object X can be represented by the d coordinates (x(1), . . . x(d)). These d coordinates are also referred to as the features in the data. This is also referred to as the feature space which may reveal interesting characteristics of the data.
The effective design of distance functions used in similarity determination has been viewed as an important task in many data mining applications. The concept of similarity has been widely discussed in the data mining literature. A significant amount of research has been applied to similarity techniques such as, for example, those discussed in the literature: A. Hinneburg et al., “What is the nearest neighbor in High Dimensional Space?,” VLDB Conference, 2000; C. C. Aggarwal, “Re-designing distance functions and distance based applications for high dimensional data,” ACM SIGMOD Record, March 2001; and C. C. Aggarwal et al., “Reversing the dimensionality curse for similarity indexing in high dimensional space,” ACM SIGKDD Conference, 2001, the disclosures of which are incorporated by reference herein.
A different but related problem in data mining is the prediction of particular class labels from the feature attributes. In this problem, there is a set of features, and a special variable called the class variable. The class variable typically draws its value out of a discrete set of classes C(1), . . . C(k). A test instance is defined to be a data example for which only the feature variables are known, but the class variable is unknown. Training data is used in order to construct a model which relates the features in the training data to the class variable. This model can then be used in order to predict the class behavior of individual test instances, also referred to as class labeling. The problem of classification has been widely studied in the literature, e.g., J. Gehrke et al., “BOAT: Optimistic Decision Tree Construction,” ACM SIGMOD Conference Proceedings, pp. 169–180, 1999; J. Gehrke et al., “RainForest: A Framework for Fast Decision Tree Construction of Large Data Sets,” VLDB Conference Proceedings, 1998; R. Rastogi et al., “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning,” VLDB Conference, 1998; J. Shafer et al., “SPRINT: A Scalable Parallel Classifier for Data Mining,” VLDB Conference, 1996; and M. Mehta et al., “SLIQ: A Fast Scalable Classifier for Data Mining,” EDBT Conference, 1996, the disclosures of which are incorporated by reference herein.
However, as sophisticated and, in some cases, complex as these similarity and classification techniques may be, these conventional automated techniques lack benefits that may be derived from human interaction during their design and application stages. Therefore, techniques are needed that effectively employ human interaction in order to design and/or perform data mining applications such as similarity determination and classification.