1. Field of the Invention
The present invention relates to data management systems, and more particularly to a system that uses online learning techniques to make predictions about records in a stream of incoming data.
2. Related Art
Organizations today collect and process an ever-increasing amount of business transaction data. To handle this transaction data, an organization will often establish a "data warehouse" comprising data extracted from online transaction processing (OLTP) systems. This transaction data is typically aggregated from multiple sources and is greatly transformed prior to being stored in the data warehouse. Thus, maintaining a data warehouse involves labor-intensive and expensive preprocessing and offline manual preparation. Nonetheless, corporations spend billions of dollars annually to create these data repositories because of the extraordinary value of the information stored within them when used for purposes of business analysis and planning.
A number of tools are used by analysts to examine and analyze the information in a data warehouse in order to model business problems and plan future actions. Online Analytic Processing (OLAP) tools are used to confirm hypotheses about the data. Using the interactive querying and data manipulation capabilities of OLAP tools, an analyst can look at the data from multiple views. This allows the analyst to compare and contrast different slices of the data. For example, one query might retrieve the total sales dollars in each of five regions for the last three quarters, while a second query might focus on sales volumes for specific products. In short, OLAP tools simply provide automated support for the traditional tasks of a back-office business analyst.
In contrast to OLAP tools, which are used to confirm hypotheses, data mining systems are used to generate hypotheses. Data mining systems use various learning algorithms to discover relationships in data and to make predictions that are not apparent, or are too complex to be extracted through the use of statistical techniques. Data mining systems automate and assist statistical analysis by packaging one or more learning algorithms (e.g., neural networks, rule induction, and clustering) with a set of utilities for extracting data from a data warehouse. Using a data mining system, an analyst can, for example, generate rules and generalizations about data.
Analytical systems, such as OLAP tools and data mining systems presuppose the existence of a data warehouse. Hence, they suffer from two shortcomings of data warehouses: (1) loss of data detail and (2) delayed access to data. Loss of detail occurs because the data stored in a data warehouse is typically aggregated from multiple sources. During this aggregation process, valuable levels of detail in the raw data are lost. For example, daily variations in product sales are lost if the data in the warehouse is aggregated by month. The second shortcoming of data warehouses is the delayed access to the data. This arises because it takes time to process the raw transaction data prior to storing it in the data warehouse. The time required for processing can range from overnight to several weeks.
Systems that use stale warehouse data do not function well in today's rapidly changing business environment because, as the business environment changes, a plan that is based on an outdated internal model will not respond appropriately to changing market conditions. Consequently, the dynamic nature of today's business environment demands a way for business systems to react reflexively and adaptively to business events as they occur, at the detailed level of individual transactions. Hence, what are needed are analytic tools that can be used in real-time, in conjunction with OLTP systems.
Another use for collected (i.e., historical) data is in data prediction. For example, historical data can be used to predict missing data values. As with the above-described traditional system models, traditional data prediction systems suffer from the use of stale data. These data prediction systems are typically trained offline, in batch mode, using only historical data. Consequently, these systems make predictions about incoming, new data using a prediction model that is based on older data that may no longer be representative of current incoming data.
FIG. 1 illustrates a traditional data processing system including OLAP tools 114 and data mining system 118. In the illustrated system, client computer systems 102, 104 and 106 communicate with application server 108. These communications include data input from client computer systems 102, 104 and 106. These communications are processed by application server 108 and are formatted for storage in transactional database 110. Client computer systems 102, 104 and 106 can additionally communicate directly with transactional database 110. This communication pathway is illustrated with the dashed lines. From transactional database 110, the data is subjected to a number of processes, such as extraction, transformation, aggregation and cleansing before it is placed in data warehouse 116.
From data warehouse 116, the data can be processed in a number of ways. First, it can be directly formatted from data warehouse 116 to produce reports 130. Second, it can be processed through OLAP tools into reports 126. As illustrated, this process does not occur automatically; it must be manually performed by an operator 120. Finally, it can be processed through a data mining system 118 into reports 128 and into a model database 124. Again, this process must be manually performed by an operator 122. Not shown explicitly in FIG. 1 is the communication network, or group of networks that couple together and facilitate communication between the various components of the system.
Another approach to building data prediction systems is rooted in academic work by the computational learning theory community in the area of "online learning." Online learning takes place in a sequence of trials. In each trial, a data record is presented to a learner, whose goal is to accurately predict whether or not the given data record has a specific property. The learner makes a prediction about whether the data record has the property, and then receives feedback about whether the prediction was correct. This feedback is used to update a model that the learner uses to make subsequent predictions. In an online learning system, there is no distinction between training and testing, since both occur within a given trial.
One online learning algorithm, called Winnow, was described by Nicholas Littlestone of UC Santa Cruz. See "Mistake Bounds and Logarithmic Linear-Threshold Learning Algorithms," by Nicholas Littlestone, a Ph.D. Dissertation from the University of California at Santa Cruz, 1989. Winnow has been shown to learn efficiently any linear threshold function. Linear threshold functions are an important class of knowledge representation, and they have long been used to represent a wide range of concepts in learning systems, including Boolean disjunctions and conjunctions of features.
Winnow's design was also based on the mistake bound model of learning. The mistake bound model of learning is an approach to the formal mathematical analysis of the worst-case behavior of a learning algorithm. In this model, it is assumed that the learner's goal is to make as few mistaken predictions as possible. Further, it is assumed that the presentation of examples to the learner is under the control of an adversary, whose goal is to select a sequence of trials in a way that maximizes the number of mistakes made by the learner. Using the mistake-bound model, one can prove upper and lower bounds on the number of mistakes made by a learner in the worst case.
The key aspect of Winnow, and similar algorithms, is that their mistake bounds grow linearly with the number of relevant features, but grow only logarithmically with the total number of features. All field/value pairs in a data record are features, while only the subset of field/value pairs that proves to be pertinent to the prediction undertaken are relevant features. Therefore, the total number of features is the total number of field/value pairs in the incoming data record, while the number of relevant features is the smaller number of field/value pairs in the subset.
Winnow has been analyzed in the presence of various kinds of noise, as well as in cases where no linear-threshold function can make perfect classifications. It has been proven, under some assumptions on the type of noise, that Winnow still learns as well as the best linear threshold function could learn, while retaining its dependence on the number of total and relevant features. In contrast to Bayesian approaches, the algorithm makes no independence assumptions, or any other assumptions, about the attributes.
Winnow is a mistake-driven algorithm; that is, it updates its model only when a mistake is made, and it only updates those parts of the model directly involved in making the mistake. This leads to significant implementation efficiencies compared with implementations of previous approaches to learning linear-threshold functions. Further, Winnow is a multiplicative-update algorithm; that is, the method used to update its state when a mistake has been made involves multiplication. This is an important factor in both the formal analysis of the algorithm and the algorithm's ability to learn to ignore irrelevant features quickly.
Theoretical analyses of the Winnow family of algorithms have predicted an excellent ability to deal with large numbers of features and to adapt to new trends. This extremely good learning behavior in high-dimensional feature spaces and in the presence of irrelevant features, is an important property that allows one to separate the learning problem from that of selecting the features. Therefore, a large set of features can be used and the algorithm will eventually discard those that do not contribute to the accuracy of the resulting set of predictions. This removes one of the major burdens associated with data preparation in an OLAP or data mining effort: the user is freed from the need to select relevant features in advance.
Although systems such as Winnow can "learn" linear threshold functions and the like, this learning takes place by updating numerical weights that are used to produce a functional output. One disadvantage of this type of learning is that the numerical weight values are not very meaningful to human decision-makers. Human decision-makers are better suited to understand association rules such as, "computer systems ordered with a 300 MHz processor and a 17-inch monitor have a 70% probability of including 64 megabytes of memory."
Hence, what is needed is an online learning system that identifies association rules between fields in incoming data records. For example, this type of association rule might say that a first value in a first field of a data record is predictive of a second value in a second field of the data record.