Conventionally, at the start of a machine learning process, a machine learning system may contain potential features that are not correlated with the target (e.g., the system may contain a billion features, only a hundred of which are indicative of a prediction). Similarly, the machine learning models may not be trained on data that is representative of the distribution of data that the model will be applied to. As an example, a model may be configured to predict a video that a user is likely to watch, based on a currently viewed video. The training data used to generate the model is likely not to include features about new videos that are not part of the corpus of videos at the current time. Accordingly, the model may not perform optimally based only on training data that is not representative of the distribution of the data that the model is applied to.