As the amount of data expands, the ability to process and comprehend that data becomes more difficult. Patterns and trends are lost in the massive quantities of data stored in databases and data warehouses. As the influx of data increases, the ability to interpret the data also becomes more difficult. Thus, there is a need for a powerful and efficient analytical tool that can process and derive interesting knowledge from the enormous amounts of data available.
Historically, the primary method for analyzing data was to construct well structured hypotheses and test the hypotheses by analyzing data. Today, a method called data mining is one of the new ways of analyzing data. Data mining is an automated process whereby previously unknown relationships among data are discovered. The two main steps of data mining are modeling, and scoring. These two steps are typically performed by a data mining tool.
Generally, modeling is the process of deriving a model or function by analyzing sets of training data. The derived model may be represented in various forms, such as classification rules, decision trees, mathematical formulae, or neural networks. If a model was a rule, for example, the rule could be a set of symptoms that a doctor uses to diagnose a disease. This rule can be derived from a set of patients who had the disease. Once derived, the rule is applied to a larger group of people to assist in determining whether they have that disease. The model that is generated using this training data is then used to make predictions about future patients.
Scoring involves making predictions with the generated model. The score may or may not be augmented with a numerical values that represents the certainty of each prediction. Scores are predictions, and can be present along with associated data, e.g., confidence measures or overall accuracy measures. Currently, there are several methods for scoring with a model. One method of scoring involves using an Open DataBase Connectivity (ODBC) or Structured Query Language (SQL) cursor. The scoring occurs where the model is stored and the scoring results are transmitted from the model's location to the database. In another method, a model may consist of data used by a C, C++ or Java function. The function may be wrapped in an application, which runs against the data stored in the database. However, this option also involves massive data movement and hence, is inefficient.
Performance-wise these are not efficient options because they involve a lot of data movement across the database to the mining tool's location. Further, many models are unusable because the execution time required is too large to process the data. Thus, there is a need for a more efficient data mining system and method.