In recent years, the progress of information automation has increased the computer databases of modem businesses to the point where a blizzard of numbers, facts and statistics are collected and stored, but less information of any significance is extracted from the massive amounts of data. The problem is that conventional computer databases are powerful in the manner in which they house data, but unimaginative in the manner of searching through the data to extract useful information. Simply stated, the use of computers in business and scientific applications has generated data at a rate that has far outstripped the ability to process and analyze it effectively.
To address this problem, a practice known as data "mining" is being developed to identify and extract important information through patterns or relationships contained in available databases. Humans naturally and quickly "mine" important concepts when interpreting data. A person can scan a magazine's table of contents and easily select the articles related to the subject of interest. A person's ability to extract and identify relationships goes beyond the simple recognizing and naming of objects, it includes the ability to make judgments based on overall context or subtle correlations among diverse elements of the available data. Computers on the other hand, cannot efficiently and accurately undertake the intuitive and judgmental interpretation of data. Computers can, however, undertake the quantitative aspects of data mining because they can quickly and accurately perform certain tasks that demand too much time or concentration from humans. Computers, using data mining programs and techniques are ideally suited to the time-consuming and tedious task of breaking down vast amounts of data to expose categories and relationships within the data. These relationships can then be intuitively analyzed by human experts.
Data mining techniques are being used to sift through immense collections of data such as marketing, customer sales, production, financial and experimental data to "see" meaningful patterns or regularities and identify what is worth noting and what is not. For example, the use of bar-code scanners at supermarket checkouts typically results in millions of electronic records which, when mined, can show purchasing relationships among the various items shoppers buy. Analysis of large amounts of supermarket basket data (the items purchased by an individual shopper) can show how often items are purchased together, such as, for example, fruit juice, children's cereals and cookies. The results can be useful for decisions concerning inventory levels, product promotions, pricing, store layout or other factors which might be adjusted to changing business conditions. Similarly, credit card companies, telephone companies and insurers can mine their enormous collections of data for subtle patterns within thousands of customer transactions to identify risky customers or even fraudulent transactions as they are occurring. Data mining can also be used to analyze the voluminous number of alarms that occur in telecommunications and networking alarm data.
The size of the data set is essential in data mining: the larger the database, the more reliable the relationships which are uncovered. Large databases, unfortunately, have more records to shift through and require many passes through the records or multiple passes through each record to uncover any regularities. The number of items for which the relationships are sought is also important to the efficiency of data mining operations: the larger the number of items, the more passes through the records that are required to extract reliable information.
Consider data mining of supermarket basket data as an example. A supermarket contains a set of items (its products), of which each shopper purchases a subset. In analyzing the volumes of subsets, it is desirable to find the "significant" purchases or large itemsets (sets of items, such as all transactions that included the purchase of fruit juice). A large itemset contains items (fruit juice, cookies) that appear in at least a preset number of subsets (customer purchases). In data mining parlance, this number is called the support threshold.
The best known method for finding large itemsets is the Apriori method described in the publication, Fast Algorithms of Mining Association Rules, by R. Agrawal and R. Srikant--Proceedings of the 20.sup.th VLDB Conference; Santiago, Chile, 1994. To discover large itemsets, the Apriori method makes multiple passes over the transaction records (see FIG. 1). In the first pass, the Apriori method counts the support of individual items and determines which of them are large, i.e., have minimum support. In each subsequent pass, this method starts with a seed set of itemsets found to be large in the previous pass. This seed set is used for generating new potentially large itemsets, called candidate itemsets, and the actual support for these candidate itemsets are counted during the pass over the data. At the end of the pass, the candidate itemsets which are actually large are determined, and they become the seed for the next pass. This process continues, pass after pass over the data, until no new large itemsets are found.
The Apriori method generates the candidate itemsets to be counted in a pass by using only the itemsets found to be large at the end of the previous pass--without considering the transactions during its pass over the database. The basic notion is that any subset of a large itemset must be large. Therefore, the candidate itemsets can be generated by joining large itemsets having less items, and deleting those that contain any subset that is not large. This procedure results in the generation of the final number of candidate itemsets at the end of a pass over the database.
As can be seen, the shortcoming of the Apriori method is that as the size of the database increases, or the number of items searched increases, so does the number of passes over the database records, from top to bottom. In very large databases, the database activity overhead due to the reported passes can reduce the execution time and therefore the efficiency of this method to an undesirable level.
A more recent data mining technique that attempts to avoid the limitations of the Apriori method is that disclosed by H. Toivonen in the paper, Sampling Large Databases for Association Rules, H. Toivonen, Proceedings of the 22.sup.nd VLDB Conference, Bombay, India, 1996. Toivonen presents database mining methods which attempt to make only one full pass over the database. The Toivonen method randomly picks a sample record from the database, uses it to determine the relationship or pattern on the assumption that it probably holds for the entire database, and then verifies the results with the rest of the database.
The Toivonen method partitions the database into sections small enough to be handled in main memory, thereby reducing I/O activity to the database (which normally resides on disk). Then, an evaluation for large itemsets for a random part is performed in main memory without further database activity. The method uses the random sample and makes a series of passes over the data to determine which items are frequently found. Each pass builds on the previous collection of frequently found items until the method efficiently finds a superset from the collection of frequently found sets. In order to increase accuracy, the superset is determined by applying this "level-wise" method on the sample in main memory, and by using a lowered frequency threshold. The method then uses the rest of the database to compute the exact frequencies of the sets found. This approach attempts only one full pass over the database, and two passes in the worst case.
Basically, this method uses a random sample of the relation to find approximate regularities, and applies those results to the entire database. The significant shortcoming of the Toivonen method, however, is that it trades off accuracy against the efficiency achieved by reducing database activity.
Because of the shortcomings of the current data mining techniques, what is needed is a method and apparatus of accurately finding large itemsets while increasing the efficiency of the database activity through a reduction in the number of passes over the database.
Accordingly, it is an object of the present invention to provide a data mining method and apparatus which reduces the number of passes made over the database while keeping the number of different itemsets which are counted in any pass low, as compared with the prior art methods.
It is still another object of the present invention to accomplish to above-stated objects by utilizing a data mining method and apparatus which is simple in design and use, and efficient to perform with regard to database activity.
The foregoing objects and advantages of the invention are illustrative of those advantages which can be achieved by the present invention and are not intended to be exhaustive or limiting of the possible advantages which can be realized. Thus, these and other objects and advantages of the invention will be apparent from the description herein or can be learned from practicing the invention, both as embodied herein or as modified in view of any variation which may be apparent to those skilled in the art. Accordingly, the present invention resides in the novel methods, arrangements, combinations and improvements herein shown and described.