1. Technical Field
The present invention relates to computerized methods, data processing systems, and computer program products for storing of data mining models.
2. Discussion of the Related Art
Data mining refers in general to data-driven approaches for extracting hidden information from input data. The extracted information depends on a type of data mining and is put together in data mining models. This model information can be further analyzed or verified against further process information.
Data mining techniques typically need to consider how to effectively process large amounts of data. Consider manufacturing of products as an example. There, the input data may include various pieces of data relating to origin and features of components. The aim of data mining in the context of manufacturing may be to resolve problems relating to quality analysis and quality assurance. Data mining may be used, for example, for root cause analysis, for early warning systems within the manufacture plant, and for reducing warranty claims. As a second example, consider various information technology systems. There, data mining may further be used for intrusion detection, system monitoring, and problem analyses. Data mining has also various other uses, for example, in retail and services, where typical customer behavior can be analyzed, and in medicine and life sciences for finding causal relations in clinical studies.
Three discovery methods of data mining are clustering, association rules, and sequences. Clustering data mining seeks to find distinct clusters or groups of data records having similar attributes. The records of one cluster should be homogeneous, and the records of two different clusters should be as heterogeneous as possible. Association rules are patterns describing which items occur frequently within transactions. Sequences data mining finds typical time-ordered sequences of items in given input data.
Traditional data mining scenarios assume that a single user designs and executes a data mining task once or few times. Iterative data mining processes such as the Cross Industry Standard Process for Data Mining (CRISP-DM) process have been designed for such scenarios. These processes are usually performed offline as a background task. However, many current and future data mining scenarios show rather different characteristics. First, data mining is not a planned task, but is rather invoked ad hoc in the course of an interactive analytical process. And second, data mining is invoked by many different users in parallel on the same datasets with partially overlapping tasks.
Interactive data mining puts much higher demands on response times than offline data mining. This is a serious problem especially in the cases where the datasets that are analyzed are rather large. If each user invokes data mining independently of each other, many resources may be wasted.
One way to address this problem is to use a general purpose caching algorithm for data mining models. Thus, if a data mining model is built, it is not only returned to the user who issued a query, but also stored in a cache. Such a cache would then allow several users to share the same model. If a model was already built by another user, it does not need to be rebuilt again later on. However, users can specify a broad variety of parameters and storage space is limited to store all built data mining models.
In the field of databases, semantic caches were developed, which are aware of how queries are related to each other and exploit this information to allow for a more intelligent caching strategy. These database query caches are not suitable to support caching of data mining models.
In the field of association rule data mining, there are known techniques of caching. For example, a chunk-based cache can be provided that stores the results of association rule mining queries along with semantic information. However, this method is limited to itemsets.
Other methods are known for mapping new data mining queries onto existing materialized, that is, cached, data mining views of previous data mining queries. These approaches are also limited to association rule mining.
The existing approaches do not address the questions which models should be cached and how a further reduction of response times can be effectively realized.