The booming of deep learning is fueled by both large data sets and large neural networks. Training a Deep Neural Network (DNN) with a large dataset is extremely computation intensive. Training requires machines with special hardware configurations such as accelerators and highspeed networking technologies with low latency and high throughput to achieve realistic training time. For a typical data science workflow, the data preparation and featurization stages and the later model evaluation stage can be run on less expensive commodity hardware such as an Apache Spark cluster at scale in the MapReduce distributed computing pattern. At the same time, some other more computationally intensive workloads, such as DNN training, may call for tightly coupled parallel implementation built upon the Message Passing Interface (MPI) framework and including accelerators to enable high performance parallelism. However, the machines with accelerators, such as Graphics Processing Units (GPUs), are generally expensive, non-commodity machines but these machines may only be partially utilized, remaining dormant when their special-purpose computing resources are not being utilized. This results in expensive, non-commodity computing resources being underutilized.