Data Mining is a common term for the process of finding useful hidden dependencies or patterns in large amounts of data. The process by which such dependencies or patterns are found is typically called an algorithm. Data Mining activity typically follows a certain workflow having several important stages: data preparation, training (also called building), testing, and application. Data preparation involves preparing the data in a format that can be utilized by an algorithm. Training involves the construction of a concise representation of the algorithm's findings about the data, referred to as the mining model. Testing involves validation of that model. Then, application involves utilizing the model to efficiently produce new previously unknown information, such as projecting the data to predict future events.
FIG. 1 is a diagram illustrating the typical organizational flow of data mining. Data 100 is first prepared 102. This may include cleaning up the formatting of the data so that it is in a form usable by the system. Then a user chooses which data to mine 104. This data is fed to a build model method 106, which builds a model based on the data. A test model method 108, then tests the model and determines whether it can be applied to other data. An application method 110 then may apply the model to other data, after which results may be obtained 112.
Data that needs to be mined may originate from a variety of sources. Each data mining algorithm (which describes how to build, test, and apply the model, among other things) may have different requirements for the data format it takes on input, and produces on output. Mining algorithm vendors have struggled to map various data sources to their input/output requirements. Each mining algorithm vendor may create algorithms that build, test, and apply a certain model. Thus far, it has been all but impossible to use the software implementation of an algorithm with a new data source.
What is needed is a solution that allows data mining algorithms from different vendors to be plugged in without any change to the algorithm software implementation, and also could be used to perform all the standard mining tasks.