“Knowledge Discovery” (“KD”) can be used for a technology-supported analysis of data to produce results useful to a user. In particular, a variety of algorithmic modeling techniques have been produced including machine learning, applied statistics, pattern recognition, and data mining.
Knowledge Discovery may involve the use of certain “learning programs”, among others, to discover useful knowledge about a particular domain of inquiry from sets of domain-specific data. For example, many commercial fields may contain data on legitimate and fraudulent transactions. For such data, it is possible for analysts to apply algorithmic modeling techniques (“data mining programs”) to the data, and extract the patterns that can be used to identify future fraudulent transactions, e.g., quicker than if the fraud is reported by a customer after receiving his/her bill.
A KD process is one of the central notions of the field of Knowledge Discovery and Data mining. The process generally is considered to comprise several stages, including preprocessing data, the application of an induction algorithm, post-processing of the output, evaluation, etc. In the past, the application of the data mining (e.g., induction) algorithm has been observed to account for 20% or less of the KD effort.
For example, a typical KD process template is shown in FIG. 1, which includes four KD-process stages, e.g., a selections stage, a preprocessing stage of the data, an application of induction algorithms stage, and a post-processing stage of an output. In particular, original data 10 is automatically selected by a selection module 15 (i.e., an operator for a stage), or by a user using an input device (e.g., a mouse, a keyboard, etc.), to generate target data 20. Thereafter, the target data 20 is forwarded to a pre-processing module 25 (i.e., another KD procedure/operator) to produce pre-processed data 30. The pre-processed data 30 is then provided to an induction module 35 (e.g., another KD procedure/operator) which produces certain models and/or patterns 40. All together, the KD procedure/operators form a KD process. These models/patterns are forwarded to a post-processing/interpretation module 45 which generates resultant knowledge data 50. The template shown in FIG. 1 is only exemplary; indeed other possible templates and/or stages could be implemented.
A KD process space can be viewed as including states and operators. The states include data sets and data-mining results, along with various descriptive characteristics. The operators may include various preprocessing algorithms, data-mining algorithms, and post-processing algorithms. An instance of the data mining process may be a series of operators that begins with a data set and ends with a mining result.
FIG. 2 shows three exemplary KD processes which may possibly be utilized for particular numeric data. In a first process 1, first numeric data 60 is provided to a decision-tree inducer 65 which generates a first model 69. In a second process 2, second numeric data 70 is pre-processed using a discretization procedure of numeric attributes 75 to be used in building a naïve Bayesian classifier 77, and thereafter a second model 79 is generated. In a third process 3, third data numeric data 80 is pre-processed by taking a random subsample thereof 85, applying a discretization procedure 87 on the numeric data, building a naïve Bayesian classifier 77 and produce a third model 89.
Intelligent Discovery Assistants (“IDAs”) are computer systems and processes which assist data mining users to explore the space of the valid KD processes. The space of valid KD processes includes those processes which do not violate fundamental constraints of their constituent techniques. For example, if an input data set includes numeric attributes (as is the case with the first, second and third data 60, 70, 80), the application of a naïve Bayes procedure on such data should not be performed since it is an invalid KD process to be applied therefor. This is because the naive Bayes procedure can only be utilized for categorical attributes, and not for numeric attributes. However, the entire second process 2 of FIG. 2 is considered to be valid because the second numeric data 70 is preprocessed using a discretization procedure 75, thereby transforming the numeric attributes of such data to categorical attributes. The IDAs utilize an explicit ontology of the KD techniques which defines the existing techniques and their properties. With such an ontology, one such IDA can perform a search of the space of valid processes, and consider the techniques to be the operators that change the world state, with preconditions that constrain their applicability.
One of the disadvantages of a number of prior art systems and methods is that they may not significantly assist the data mining user with a selection of an appropriate set of the KD processes. In one particular scenario, when presented with a data set to mine, the KD user may be faced with a confusing array of choices. For example:                should C4.5 technique be used on the data (as opposed to a naive Bayes procedure or a logistic regression algorithm);        should the discretization be used, and if so, which method;        should the data be sub-sampled;        should the resultant class description be pruned; and        should costs of a mis-classification be taken into consideration, etc.?        
For a novice user, these choices are overwhelming. Many novice users simply use the algorithm that they are familiar with, with little pre-processing or post-processing. Even KD expert users do not have knowledge of each and every technique applicable to each type of data.