The present invention relates to data clustering, also known as data segmentation, and more particularly to such clustering or segmentation that is oriented toward the goal of prediction.
Data clustering is a useful technique for gaining insights from data sets that are large, such as purchase data from electronic commerce (xe2x80x9ce-commercexe2x80x9d) Internet web sites, Internet web site viewing data, etc. In this approach, the data can be considered as items, rows, cases, or records, having various features, variables, or columns. For example, items could be users, and features could be web page viewings on a news-oriented web site.
Items are generally clustered together such that items within a given cluster have similar features in some sense, and items with dissimilar features are within different clusters. Once clusters are formed by a given machine-learning algorithm or other technique, a data analyst can then inspect and examine the clusters to gain insights into the relationships among the data. For example, an analyst may learn that one cluster of web users frequent sports stories, another cluster of users frequent tabloid stores, etc. These insights can then be used for various purposes.
In many clustering applications, these purposes include making predictions. For example, a user""s cluster might be used to predict what ads the user is likely to click onxe2x80x94that is, select. In general, when clusters are used to make predictions, it is usually important to distinguish features that are known at the time the predictions are made, which are referred to as input features or variables, from features that are to be predicted, which are referred to as output features or variables. For example, the stories on web pages read by a user could be considered the input variables, while the ads clicked on by the user could be considered the output variables.
In prior art modeling and clustering techniques, input and output variables are treated symmetrically. More specifically, the techniques explicitly or implicitly learn data models that are good estimations of the probability of inputs and outputs, which can be expressed as p(inputs, outputs). However, data analysts typically already have at their disposal the input data, such that they wish to predict output data. That is, rather than estimating the probability of inputs and outputs, analysts are usually more concerned with predicting the probability of outputs conditioned on the inputs, which can be expressed as p(outputs|inputs). For this and other reasons, there is a need for the present invention.
This invention relates to goal-oriented data clustering. In one embodiment, a computer-implemented method is operable on a number of variables that have a predetermined representation. The variables include input-only variables, output-only variables, and both input-and-output variables. The method generates a model that has a bottleneck architecture. The model includes a top layer of nodes of at least the input-only variables, one or more middle layer of hidden nodes, and a bottom layer of nodes of the output-only and the input-and-output variables. At least one cluster is then determined from this model.
Embodiments of the invention provide for advantages not found within the prior art. For example, models according to embodiments of the invention treat input and output variables asymmetrically. Variables that are input only (that is, only used for prediction) are represented by the nodes of the top layer of the model, and in one embodiment, are the only variables represented by the top-layer nodes. Variables that are output only (that is, which are only predicted) are represented by the nodes of the bottom layer of the model. Variables that can be either input or output variables, or both, (that is, used for prediction and/or are predicted) are represented at least by the bottom-layer nodes, and in one embodiment, also by nodes within the top layer of the model. Thus, for example, embodiments of the invention can provide for the estimation of p(output variables|input variables), in distinction to the prior art.
The invention includes computer-implemented methods, machine-readable media, computerized systems, and computers of varying scopes. Other aspects, embodiments and advantages of the invention, beyond those described here, will become apparent by reading the detailed description and with reference to the drawings.