Large-scale networked systems are commonplace platforms employed in a variety of settings for running applications and maintaining data for business and operational functions. For instance, a data center (e.g., physical cloud computing platform) may provide a variety of services (e.g., web applications, email services, search engine services, etc.) for a plurality of customers simultaneously. These large-scale networked systems typically include many resources distributed throughout the data center or throughout multiple data centers in a region or multiple regions across the globe. Resources can resemble a physical machine or a virtual machine (VM) running on a physical node or host. The data center runs on hardware (e.g., power supplies, racks, and Network Interface Controllers (NIC)) and software components (Applications, Application Programming Interfaces (APIs), Databases) that rely on each other to operate.
Large-scale public cloud providers invest billions of dollars into their cloud infrastructure and operate hundreds of thousands of servers across the globe. New data centers are being built and expanded across the globe. However, even with the state-of-the-art cluster management and scheduling techniques, the average resource utilization in data centers is often low. Some reasons for this low resource utilization are common for many data centers, such as some capacity is required as buffers to handle the consequences of failures; natural demand fluctuation causes capacity to be unused at certain times; servers are over-provisioned to handle load-spikes; fragmentation at the node and cluster level prevents all machines from being fully utilized; churn induces empty capacity; and so forth.
Unutilized computing resources that can be used at least temporarily at a computing platform may be referred to as transient resources. Running latency-insensitive jobs en masse on transient resources could be a key to increase resource utilization. However, traditional distributed data processing systems such as Hadoop or Spark (Apache Spark™ is an open source cluster computing framework) are designed to run on dedicated hardware, and they perform badly on transient resources because of the excessive cost of cascading recomputations typically required after the transient resources fail or become unavailable.