Big data architectures, that is, architectures for processing high velocity, high volume, high variability “big data streams”, are enabling new applications for knowledge discovery. Big data is a term used for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Such architectures are used today in finance for identifying transactions risks, in retail sales for personalizing marketing campaigns, in computer security for identifying malware and illegal network traffic, and in medicine for generating targets for new diagnostics and therapeutics. Further applications include grouping of web search results and news aggregation. Generally the data being classified is multi-dimensional, that is, includes many attributes or variables. Applications in these areas are also sometimes referred to as “complex event processing” or “event stream processing”. Often these applications include processes for classifying data by classification schemes generated from previously acquired data.
Techniques for generating grouping-segmentation-classification schemes include univariate or multivariate distribution methods such as Gini index and ROC AUC, or clustering methods, such as K-means, and Ward. These latter methods are computationally intensive, meaning that they generally cannot be applied in real-time for complex event processing.
An example of training a classification scheme by clustering, in order to categorize technical support requests, is described by Barrachina and O'Driscoll, “A big data methodology for categorizing technical support requests using Hadoop and Mahout”, Journal of Big Data, 2014 1:1.
Several methods have been disclosed in the prior art for updating classification schemes to reflect changing data patterns.
Hulten et al., “Mining Time-Changing Data Streams” (Proc. Seventh ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, ACM Press, 2001) describes a system for generating decision trees “from high-speed, time changing data streams”. The system updates a decision tree with a window of examples. Forman, “Tackling Concept Drift by Temporal Inductive Transfer” (SIGR '06, August 2006, ACM Press) describes reclassifying news on a regular basis, such as daily.
U.S. Pat. No. 9,111,218 to Lewis, et al. describes receiving a stream of documents and classifying each document according to a customer support issue or sentiment. The method includes assigning classification topics. A drift of one or more of the classifications is determined when a drift exceeds a predetermined threshold range, whereupon “the plurality of documents are re-clustered into the increased number of groups”.
A further example is U.S. Pat. No. 8,919,653 to Olmstead, describing a classification scheme updated when exceptions are received for an automated checkout system. In the event of an exception, an outlet displays a visual representation of the exception, allowing a customer to clear the exception in an unassisted manner.