1. Field of the Invention
The present invention relates generally to online searching for data dependencies in large databases and more particularly to an online method of data mining of data items in a large database.
2. Discussion of the Prior Art
Data mining, also known as knowledge discovery in databases, has been recognized as a new area for database research. The volume of data stored in electronic format has increased dramatically over the past two decades. The increase in use of electronic data gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion of available data. Data storage is becoming easier and more attractive to the business community as the availability of large amounts of computing power and data storage resources are being made available at increasingly reduced costs.
With much attention focused on the accumulation of data, there arose a complimentary need to focus on how this valuable resource could be utilized. Businesses soon recognized that valuable insights could be gleaned by decision-makers who could make use of the stored data. By using data from bar code companies, or sales data from catalog companies, it is possible to gain valuable information about customer buying behavior. The derived information might be used, for example, by retailers in deciding which items to shelve in a supermarket, or for designing a well targeted marketing program, among others. Numerous meaningful insights can be unearthed from the data utilizing proper analysis techniques. In the most general sense, data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. The objective of data mining is to source out discernible patterns and trends in data and infer association rules from these patterns.
Data mining technologies are characterized by intensive computations on large volumes of data. Large databases are definable as consisting of a million records or more. In a typical application, end users will test association rules such as; "75% of customers who buy Coke also buy corn chips", where 75% refers to the rule's confidence factor. The support of the rule is the percentage of transactions that contain both Coke and corn chips.
To date the prior art has not addressed the issue of online mining but has instead focused on an itemset approach. A significant drawback of the itemset approach is that as the user tests the database for association rules at differing values of support and confidence, multiple passes have to be made over the database, which could be of the order of Gigabytes. For very large databases, this may involve a considerable amount of I/O and in some situations, it may lead to unacceptable response times for online queries. A user must make multiple queries on a database because it is difficult to guess apriori, how many rules might satisfy a given level of support and confidence. Typically one may be interested in only a few rules. This makes the problem all the more difficult, since a user may need to run the query multiple times in order to find appropriate levels of minimum support and minimum confidence in order to mine the rules. In other words, the problem of mining association rules may require considerable manual parameter tuning by repeated queries, before useful business information can be gleaned from the transaction database. The processing methods of mining described heretofore are therefore unsuitable to repeated online queries as a result of the extensive disk I/O or computation leading to unacceptable response times. The need for expanding the capabilities of data mining to the internet requires dynamic online methods rather than the batch oriented method of the itemset approach. It is therefore a primary object of the invention to provide a computationally efficient method for making online queries on a database to evaluate the strength of association rules utilizing user supplied levels of support and confidence as predictors.