With the increasing availability of data storage and computing power, it is feasible for a variety of businesses and entities to collect, store, and analyze large amounts of data related their activities. For example, web retailers, stock brokers, and even gamblers “crunch the numbers,” to gain an advantage over their competition. In some instances, an entity is looking for a trend, such as sales correlated with a new web layout. In other instances, an entity is looking for an outlier, indicating an activity, asset, or performance that has gone overlooked. The outlier might be bad: indicating a genetic propensity for a disease, an underperforming business unit, or insider trading. Alternatively, the outlier might be good: indicating a teaching style that is boosting class scores or a minor-league pitcher that will succeed in the big leagues.
As the amount of data collected by businesses increases, the efficiency of automated techniques for analyzing that data should increase in order to allow for timely analysis. Furthermore, most items or events may be characterized by many different markers such as time, place, cost, date, color, style, etc., etc., resulting in huge multivariable data sets. Looking for trends, while taking all of these variables into consideration, is a formidable task. That is, even with today's powerful computing resources, identifying outliers from a data set of thousands of events, where each event has twenty or more variables, can require substantial time and expense.
Techniques such as clustering can help to reduce the time and complexity of identifying outliers (and trends) by binning data with similar characteristics. Clustering typically involves assigning multidimensional data into subsets, or clusters, so that the observations of the same cluster are similar in some sense. Once the clusters are identified and filled, the size and relationship of the clusters can be compared in order to spot trends (big clusters) or outliers (clusters that are “far away” from the others).
Typically, clustering achieves its most profound results when using classifications that are unsupervised. That is, the data is not labeled ahead of time (“fraudulent”, “cancer”, or “profitable”) and used as a training data set to direct classification of similar data. In these examples, the data, itself, becomes the source of the labels, and the technique can uncover unexpected correlations that represent overlooked opportunities. By looking at the data holistically, including multiple dimensions, it is possible to identify groupings that were not appreciated previously. For example, unsupervised clustering is able to segment customers with similar behaviors so that marketers can target specific, not previously identified segments with the highly customized content. Such cluster analysis driven marketing campaigns have already proved their effectiveness.
Nonetheless, even “unsupervised” clustering must have rules to direct the clustering toward a valuable result. Without some rules, the clustering will have no significance.
Of course, adding the complexity of unsupervised classification to a large multivariable data set can make meaningful analyses untenable because of the massive computational resources needed. The cost of preparing and performing such analyses is prohibitive for most entities. Accordingly, there is a need for improved methods of unsupervised clustering that allow clusters to be efficiently determined. Such techniques allow more computing resources to be directed to comparing the clusters in order to identify trends and outliers.