One-class collaborative filtering is a problem that naturally occurs in many different settings. One such setting is with regard to the analysis of clickstream data, which refers to a list of links on which a particular user has clicked. Clickstream data, for example, only contains information identifying which websites were visited by a user during a given period of observation. Moreover, clickstream data does not provide any indication of why a user did not visit sites that were not visited. It could be, because the user does not like a particular site, or because the user did not know about the site, or that the site was visited outside the period of observation, to name just a few examples. There is no accounting for any of these reasons in clickstream data. In addition, clickstream data tends to be sparse. As used herein, the terms “sparse” or “sparsity” refer to a set of data sets in which a number of unobserved items greatly exceeds a number of observed items.
In certain circumstances, it may be desirable to predict a user's interests based on clickstream data or other sparse data. Sparse data regarding items purchased by a user may be used to predict other items the user might prefer from a larger data set, without any explicit ratings or other background information. In addition, sparse data regarding which software modules a user has already installed may be used to predict additional modules the user might prefer, without any explicit feedback about those modules from the user. Effective prediction of user interest allows a provider to deliver content the user is more likely to enjoy or prefer, such as personalized news, advertisements or the like. In making such predictions, it is desirable to identify websites that have not yet been visited by the user, but that the user is likely to prefer.
In a one-class collaborative filtering problem relating to predicting items for which the user may express a preference, items for which the user has already expressed a preference (e.g., web pages actually clicked on) are assigned a particular value. For example, a logical “one” may correspond to preference by the user. The number of items for which the user has actually expressed a preference is likely to be sparse relative to the universe of available items. A matrix may be constructed to represent the universe of available items, with a logical “1” occupying all positions corresponding to items for which the user has actually expressed a preference.
When attempting to predict an item a particular user might prefer, there are essentially two known strategies for treating the items for which the user has not explicitly expressed a preference. In the first approach, which is based on a singular value decomposition (referred to as “SVD” herein), the items for which the user has not explicitly expressed a preference are assumed to have the same specific value for the weighted likelihood that the user will prefer them. For example, when predicting web pages a user may prefer based on sparse data regarding the web pages visited by the user, logical zeroes may be used for all web pages not visited by the user. This corresponds to an initial assumption that the user will not prefer those web pages. Subsequent iterations of predictive data may be calculated based on another matrix that represents a confidence in the prediction based on user preference data obtained from other users. Such a scenario is essentially premised on the notion that the degree to which a user is not likely to prefer any given item not chosen by the user may be based on the preference data from other users. For example, a prediction algorithm may assign a high confidence (for example, 0.95) to the assumption that the user will not prefer a particular item if many other users with similar demographic profiles have shown a high likelihood of not preferring that item. A low confidence (for example, 0.05) may be assigned to the assumption that the user will not prefer a particular item if many other users with similar demographic profiles have shown a high likelihood of preferring the item. A prediction may be made that a particular user will prefer an item for which no user preference data relative to the particular user is available by selecting an item having a sufficiently high preference by other users with some characteristics in common with the particular user. Moreover, if the weighted likelihood that the user will prefer an item based on data obtained from other users exceeds a certain preset level, the item may be presented to the user as a prediction via, for example, a web browser or the like.
The second approach to treating likelihood data that the user will prefer items for which the user has not explicitly expressed a preference involves treating the likelihood that a user will prefer each specific item for which no preference data relative to the particular user as missing rather than substituting an arbitrary value. An example of this approach is an alternating least squares methodology, which may be referred to as “ALS” herein. In such an approach, all non-missing values (for example, values corresponding to items the user is known to prefer) are all the same (for example, logical “ones”). In contrast, unobserved values are explicitly left blank. Regularization is needed to enforce any kind of generalization (to avoid a trivial solution that predicts the same value for every missing data instance).
One-class collaborative filtering problems may employ different weighting schemes based on whether a value is present or missing, and—optionally—based on the individual user and item under consideration to improve the predictive power of collaborative filtering models compared to (i) SVD methods that substitute zeros for all missing values, and (ii) ALS methods that are capable of ignoring missing values. In the case of the ALS approach in which there is only a single non-missing value (for example, a logical “one” to show that a user is known to prefer a particular item), the ALS method generalizes only due to a regularization of latent feature vectors. Only recently it has been suggested to use a weighted variant of ALS to balance the two extremes above. It can be used to weight the missing values after substituting logical zeros for them, which has been shown to yield better predictions in practice.
There are disadvantages to methods discussed above that require the substitution of default values (like logical “zeroes”) for missing values. This seems necessary when substituted values are subsequently given weights corresponding to a confidence level in the arbitrary value likelihood value assigned to the item. This is problematic, because the practically most relevant case is that of a large but sparse matrix (for example, n users by m items and Θ(m+n) many non-missing values). Taking into account the number of latent variables as a constant, then substituting all missing values increases the runtime complexity from O(n+m) to Ω(n*m). Because collaborative filtering relies on a large number of users and is usually performed on extremely sparse matrices, such an increase in runtime makes obtaining a solution practically intractable, especially for the most attractive data sets. In contrast, unweighted ALS methodologies can accommodate the missing values in a way that allows for runtimes in O(n+m), but as mentioned above, such methodologies lack the good generalization performance of its weighted counter-part.
One attempt to overcome the large increase in runtime complexity of an SVD-type methodology with weighting employs an ensemble technique that runs collaborative filtering multiple times. Each time, only a relatively small sub-sampled fraction of the negative examples (arbitrarily weighted likelihood values) is used. This sub-sampling approach makes the ensemble methodology feasible in practice from a computational cost standpoint, but at the cost of (i) decreasing the amount of negative examples considered during training, which reduces the expected quality of results, while (ii) still increasing the runtime considerably compared to the case of ALS without substituting any examples. This occurs because the costly collaborative filtering base algorithm is run multiple times, and even on a larger data set than in the sparse case.