Large output spaces are ubiquitous in several machine learning problems today. Such machine learning problems can include, for example, extreme multiclass or multilabel classification problems with many classes, language modeling with big vocabularies, or metric learning with a large number of pairwise distance constraints. In all such problems, a key bottleneck in training models is the evaluation of a loss function and its gradient. The loss functions used for such problems typically require an enumeration of all the possible outputs, and can require a linear running time in the number of outputs for evaluation. This can be a significant bottleneck in iterative methods such as gradient descent used to train the model, since each step can require a huge number of operations.