1. Field of the Invention
This invention relates generally to data processing, and more particularly to "computer database mining" in which association rules which characterize a relationship between significant transactions, that are recorded in a database, are identified. In particular, the invention concerns the identification (i.e., mining) of itemsets or rules in a large database of transactions that are selected not because they are frequent, but because they show surprising patterns along time.
2. Description of the Related Art
Customer purchasing habits can provide invaluable marketing information for a wide variety of applications. For example, retailers can create more effective store displays and more effectively control inventory that otherwise would be possible if they know that, given a consumer's purchase of a first set of items (a first itemset), the same consumer can be expected, with some degree of likelihood of occurrence, to purchase a particular second set of items (a second itemset) along with the first set of items. In other words, it is helpful from a marketing standpoint to know the association between the first itemset and the second itemset (the association rule) in a transaction. For example, it would be desirable for a retailer of automotive parts and supplies to be aware of an association rule expressing the fact that 90% of the consumers who purchase automobile batteries and battery cables (the first itemset) also purchase post brushes and battery post cleansers (referred to as the "consequent" in the terminology of the invention).
Advertisers too may benefit from a thorough knowledge of such consumer purchasing tendencies since they may change their advertising based upon the information mined from the database. In addition, catalog companies may be able to conduct more effective mass mailings if they know the tendencies of consumers to purchase particular sets of items with other set of items. It is understood, however, that although this discussion focuses on the marketing applications of the invention, database mining and, hence, the principles of the invention, are useful in many other areas, such as business or science, for example.
Until recently, building large detailed databases that could chronicle thousands or even millions of consumer transactions was impractical. In addition, the derivation of useful information from these large databases (i.e., mining the databases) was highly impractical due to the large amounts of data in the database which required enormous amount of computer processing time to analyze. Consequently, in the past, marketing and advertising strategies have been based upon anecdotal evidence of purchasing habits, if any at all, and thus have been susceptible to inefficiencies in consumer targeting that have been difficult if not impossible to overcome.
Modem technology, such as larger, faster storage systems and faster microprocessors, have permitted the building of large databases of consumer transactions. In addition, the bar-code reader may almost instantaneously read so called basket data (i.e., when a particular item from a particular lot was purchased by a consumer, how many items the consumer purchased, and so on) so that the basket data may be stored. In addition, when the purchase is made with, for example, a credit card, the identity of the purchaser is also known and may be recorded along with the basket data.
However, building a transactions database is only part of the challenge. Another important part of the marketing is mining the database for useful information, such as the association rules. The database mining, however, becomes problematic as the size of the database expands into the gigabyte or terabyte size.
Not surprisingly, many methods have been developed for mining these large databases. The problem of mining association rules from large databases was first introduced in 1993 at the ACM SIGMOD Conference of Management of Data in a paper entitled, "Mining Association Rules Between Sets of Items in a Large Database" by Rakesh Agrawal, Tomasz Imielinski and Arun Swami. In general, the input, from which association rules are mined, consists of a set of transactions where each transaction contains a set of literals (i.e., items). An example of an association rule is that 30% of the transactions that contain beer and potato chips also contain diapers and that 2% of all transactions contains all of these items. In this example, 30% is the confidence of the association rule and 2% is the support of the rule. The typical problem is to find all of the association rules that satisfy user-specified minimum support and confidence constraints. As described above, this mining of association rules may be useful, for example, to such applications as market basket analysis, cross-marketing, catalog design, loss-leader analysis, fraud detection, health insurance, medical research and telecommunications diagnosis.
To better understand the context of the invention, a brief overview of typical association rules and their derivation is now provided. First let I={1.sub.1, 1.sub.2, . . . 1.sub.m } be a set of literals called items. Let D be a set of transactions, where each transaction T is a set of items such that T{character pullout}I. Therefore, a transaction T contains a set A of some items in I if A{character pullout}T. An association rule is an implication of the form A{character pullout}B, where A.OR right.I, B.OR right.I, and A.solthalfcircle.B=.O slashed.. The rule A{character pullout}B holds true in the transaction set D with a confidence "c" if c % of transaction in D that contain A also contain B (i.e., the confidence in the conditional probability p(B.vertline.A)). The rule A{character pullout}B has support "s" in the transaction set D if s % of the transactions in D contain A.orgate.B (i.e., the support is the probability of the intersection of the events). Given a set of transactions D, the computational task of a typical mining system is to generate all associations rules that have a support value and a confidence value greater than a user-specified minimum support value and minimum confidence value. Because a user of the system must estimate and select an appropriate confidence level and an appropriate support level, the typical association rule mining system may generate too few or too many association rules depending on the confidence and support values that are chosen. Therefore, the effectiveness of the typical association rule mining system is substantially dependent upon the user selecting the appropriate values for the confidence and support variables.
key limitation of existing itemset and rule mining systems are that they look for prevalent patterns, not surprising ones. These broad regularities (i.e., prevalent patterns) are already known to people who understand the particular domain (i.e., a domain expert). What these experts not want from large data warehouses are surprising, novel and non-trivial patterns which can indicate unexpected phenomena. Thus, there is a broad consensus that the success of a data mining system will depend critically on the ability to go beyond obvious patterns and find novel and useful patterns. Otherwise, as described above, the results of data mining will often be a large number of association rules which lack novelty, making it overwhelming difficult and unrewarding for the analyst to sift through them. In addition, a domain expert who is already familiar with the application domain is very unlikely to be satisfied with merely the prevalent patterns, because (1) presumably the company he works for is already exploiting them to the extent possible (because they already know about them) and (2) the competition also already knows about these patterns. Therefore, for data mining to be useful, it is desirable to discover the surprising patterns.
There are some systems and methods for discovering sequential patterns in transactions. These sequential pattern discovery systems discover a sequence of itemsets that are disjoint and a sequential pattern is characterized as interesting as long as its support/confidence, as described above, taken over all data sequences goes over a user-specified threshold. An example of a sequential pattern is that, "80% of the time, within a week of renting Star Wars, the customer also rents "Return of the Jedi". A sequential pattern discovery system does not discover itemsets whose support has changed over time. It is desirable, however, to be able to discover surprising temporal patterns (i.e., an itemset whose support changes over time) in transactions. This is quite distinct from temporal patterns as will become apparent in what follows.
There are also conventional systems which determine a local estimate of support in which the local estimate of support needs to have a time window specified by the user. The problem is that a small window may lead to poor statistical confidence of estimated parameters while a large window may skip over a surprising segment and average it into a larger, uninteresting one. Thus, for this conventional system, the selection by the user of window size, smoothing function, test statistic and surprise threshold are all critical choices best made by one highly experienced in the particular domain. The selection of variables is also critical for conventional approaches based on frequency transformations. It is desirable, however, to avoid any need for variable tuning and instead base the analysis on only one assumed property that the baskets of market data are independently drawn from a possibly drifting distribution.
A simple measure of surprise is the extent that an itemset departs from a model which may be estimated using standard multivariate time series analysis. Unfortunately, such simple estimation involves models, smoothing windows and parameters whose optimal choices can vary dramatically from one application to another and therefore do not provide a good measure of the surprise. Therefore, a system and method for discovering surprising temporal patterns which avoids these and other problems of the conventional systems is desirable and it is to this end that the present invention is directed.