Machine learning (ML) is becoming an increasingly popular application in the cloud and datacenters. Current software for distributed ML leverages the specific properties of ML program to achieve high performance. However, such software is not elastically adaptable to the changing of computation resource availability in multi-user (or multi-tenant) environments such as modern clouds and datacenters in which they run, where the set of currently running jobs and available computation resources (CPU, memory, etc.) at any given time are constantly changing. It is therefore highly desirable for applications executing in such an environment to be elastic, being able to opportunistically use additional resources when offered, and gracefully cede acquired resources when requested.
Elasticity is beneficial for both the individual job and for the cluster as a whole. An elastic job can make use of idle resources to complete within a shorter amount of time, and avoid completely halting when some of its resources are evicted. A cluster-wide job scheduler can dynamically re-allocate resources to speed up urgent real-time or interactive jobs, and ensure fairness by preventing jobs from holding frequently requested resources for long periods of time.