1. Technical Field
The present invention is directed to the fields of data warehousing, data mining, and data modeling. More specifically, the present invention provides a model repository for use in creating, storing, organizing, locating, and managing a plurality of data models.
2. Description of the Related Art
Modem business enterprises generate large amounts of data concerning the operation and performance of their businesses. This data is typically stored within a large data warehouse, or some other large database infrastructure. Business analysts then review this voluminous data in order to make business recommendations. The data may be analyzed manually, in order to develop an intuition about the data, or to pick up patterns in the data, or it may be analyzed using statistical software to determine trends, clusters of data, etc.
More recently, with the explosion of Internet-related traffic, business enterprises are generating volumes of data that are one or more orders of magnitude larger than before. This increase in scale has made it almost impossible to develop an intuition about the data or to pick up patterns in the data by simply examining the data in its original form. Similarly, this increase in scale has made it difficult to manually execute separate statistical analyses on the data.
As a result of this data explosion, data mining software has been developed. A data mining software application can search through the large volumes of data stored in the data warehouse and can identify patterns in the data using a variety of pattern-finding algorithms. These patterns are then used by the business analyst in order to make business recommendations. An example of such a data mining tool is Enterprise Miner™, available from SAS Institute, Inc., of Cary, N.C.
Each run of the data mining software is based on a specification. Part of the specification indicates which input data to use from the data warehouse. Another part of the specification activates one or more of the pattern-finding algorithms that are built into the data mining software. Other parts of the specification specify how to partition the data, how to assess the results, etc.
When the data mining tool is executed according to a particular specification, it generates a resulting analysis that is termed a model. The model contains information regarding the specification used for the run, including the name and location of the data set in the data warehouse that was analyzed, and also contains the resulting analysis, including any patterns that may have been detected in the data set. The model may also contain information regarding how well the pattern represents the analyzed data.
Over time, an enterprise having such a data mining tool may generate a multitude of different models, based on different input data, different data sampling techniques, different data partitions, different data mining algorithms, different assessment methods, etc. Not all of these models are useful, however. For example, some of the models may be better than others at predicting a particular outcome. Some of the models may be out of date. And some may not provide any useful results at all, or may not be able to predict any patterns in the data.
Typically, each person who generates models (i.e., a model creator) manually keeps track of his or her own collection of models. Thus, models are scattered around the enterprise—wherever anyone who is generating models happens to reside. There is no straightforward way for people who want to use models to know which ones (other than their own) are available and to find the one(s) appropriate for a given purpose. Tracking down or duplicating the generation of appropriate model(s) requires extensive human resources and time.
Thus, a problem with these types of data mining systems is the inability to effectively manage the multitude of data models that are generated, and the corresponding inability to distinguish useful models from those that have limited utility. This problem is amplified when the models are generated and used by a large number of users. If the models are generated and used by only one person, or a handful of persons, then the person (or persons) generally has a good idea of which models are available and may also have a cheat-sheet for which model is associated with certain specifications, patterns, or other results. When a large number of persons are generating models, however, each model generator (or creator) would have less of a sense of which models are available, and which ones are useful. In this latter situation, a particular model creator could duplicate much of the work that may have already been done by another model creator because he or she did not know that the model, or something very close to it, had already been developed by the other model creator. In addition, people that use the models, but do not necessarily create them, typically have no idea which models are useful, up to date, or otherwise applicable to a particular analysis, or would yield a particular result.
Thus, there remains a general need in this field for a system and method for creating, storing, organizing, locating and managing a plurality of models generated by a data mining application or other application.