There are two broad approaches for performing machine learning (ML) on Graphics Processor Units (GPUs). First, create hand-crafted libraries of implementations in the form of GPU kernels targeting specific algorithms, such as support vector machines, neural networks, and decision trees. Second, implement ML processes by composing or stitching together a sequence of invocations to GPU accelerated primitive operations, such as matrix-vector multiplication. While the first approach lacks flexibility and customization, the latter approach can potentially be very inefficient.