A recent trend in database applications is the installation of larger and larger data warehouses built around relational database technology in an increasing number of enterprises. There is a great demand for being able to mine nuggets of knowledge from these data warehouses. The initial research on data mining was concentrated on defining new mining operations and developing algorithms for them. Most early mining systems were developed largely on file systems in which specialized data structures and buffer management strategies were devised for each mining algorithm. Coupling data mining with database systems was at best loose, and access to data in a database management system (DBMS) was provided through an Open Database Connectivity (ODBC) interface or a Structured Query Language (SQL) cursor interface. Such an interface is described, for example, in the IBM Intelligent Miner User's Guide, Version 1 Release 1, published by International Business Machines Corp., July 1996. Other interfaces are described by R. Agrawal et al. in the paper "The Quest Data Mining System," Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oreg., August 1996, and by J. Han et al. in "DMQL: A Data Mining Query Language For Relational Databases," Proc. of the 1996 SIGMOD Workshop on Research Issues on Data mining and Knowledge Discovery, Montreal, Canada, May 1996.
Recently, researchers have started to focus on issues related to integrating mining with databases, such as proposals to extend the SQL language to support mining operators. For instance, the query language DMQL proposed by Han et al. extends SQL with a collection of operators for mining characteristic rules, discriminant rules, classification rules, association rules, etc. In the paper "Discovery Board Application Programming Interface and Query Language for Database Mining," Proc. of the 2nd Int'l Conference on Knowledge Discovery and Data Mining, Oregon, August 1996, Imielinski et al. extend M-SQL, which is an extension of the SQL language with a special unified operator to generate and query a whole set of propositional rules. Another example is the mine rule operator proposed by Meo et al. for a generalized version of the association rule discovery problem, described in "A New SQL Like Operator For Mining Association Rules," Proc. of the 22nd Int'l Conference on Very Large Databases, India, September 1996. Tsur et al. also proposed "query flocks" for data mining using a generate-and-test model, as described in "Query Flocks: A Generalization of Association Rule Mining," available on the World Wide Web at http://db.stanford.edu/ullman/pub/flocks.ps, October 1997.
The issue of tightly coupling a mining algorithm with a relational database system from the systems point of view was addressed by R. Agrawal et al. in a paper entitled "Developing Tightly-coupled Data Mining Applications On A Relational Database System," Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Oregon, August 1996. This proposal makes use of user-defined functions (UDFs) in SQL statements to selectively push parts of the application that perform computations on data records into the database system. The objective here was to avoid one-at-a-time record retrieval from the database to the application address space, saving both the copying and process context switching costs. In the paper entitled "KESO: Minimizing Database Interaction," Proc. of the 3rd Int'l Conference on Knowledge Discovery and Data Mining, August 1997, A. Siebes et al. focus on developing a mining system with minimal database interaction. Another algorithm for finding association rules, SETM, was expressed in the form of SQL queries and described by M. Houtsma et al. in "Set-oriented Mining of Association Rules," Proc. of the Int'l Conference on Data Engineering, Taiwan, 1995. However, SETM is not efficient and there are no results reported on running it against a relational DBMS.
Therefore, there is still a need of a method for efficiently mining data from an integrated database and data-mining system that has a shorter response time, requires less memory to operate, does not suffer the disadvantages discussed in the Background section.