An online system may employ a prediction model for various uses, such as to determine the likelihood that a user will click on an ad, or select an item within search results, etc. The online system can then take various actions based on its predictions, such as by serving one or more ads having high click probability, or by displaying items having high click probability in prominent positions within search results.
In one case, the prediction model may produce a prediction by identifying a collection of features which describe an event (where an “event” generally corresponds to the circumstance in which a prediction is being made). Those features collectively form a feature vector. The prediction module then maps the feature vector into a prediction. Some features correspond to individual attribute values (such different user IDs, different ad IDs, etc.), while other features correspond to combinations of attribute values (such as different combinations of user IDs and ad IDs). The feature vector in this case includes a feature for each individual attribute value and each combination of attribute values; however, the feature vector will be sparsely populated for any given event, meaning that it will include only a small number of non-zero features when it is used to describe any particular event. As can be appreciated, the feature space associated with the above-described type of prediction model may have very high dimensionality. Among other problems, it is difficult to train this kind of prediction model in an efficient manner, particularly in those cases in which the prediction model is complex (e.g., non-linear) in nature.