In general, the present application is directed to systems and methods of scheduling applications and tasks running on a server. More specifically, the present invention is directed to task packing—a strategy to schedule tasks on a subset of available servers instead of distributing it evenly.
A common application processing architecture is often to have a pool of servers (a Cluster) processing different tasks. Tasks may be distributed amongst different servers using, among other things: (i) a Central Queue, which may be used in job processing frameworks like Python's Celery or Ruby's DelayedJob, in which each server in the queue may poll the queue to obtain tasks to run on itself; and or (ii) a Central Scheduler, which may actively hands out tasks—for example: Apache YARN used by BigData frameworks like MapReduce, Tez and Spark, Kubernetes used to schedule Containers over a server farm. Schedulers such as YARN (which may be utilized by MapReduce, Tez, and/or Spark) tries by default to allocate resources to tasks based on the availability of resources on each server, as well as locality constraints specified by the task. If multiple servers satisfy these constraints, such schedulers generally allocate resources uniformly among qualified servers.
This design works well in on-premise, fixed-size cluster deployments, ensuring that a single node failure doesn't have much impact on running jobs. It also helps to utilize as many nodes as possible and avoids over-utilization of a single node. While YARN tries its best to keep track of resource utilization, it is inherently hard to account for usage of resources like network links accurately, and spreading the load around is an easy way to avoid resource hot-spots.
In a cloud deployment it is common to have an elastic cluster such as Qubole's Auto-Scaling Hadoop/Spark cluster. Users configure a minimum and maximum cluster size and the cluster automatically scales up and down according to the workload and other factors.
Critical to downscaling is finding nodes that can be removed from the cluster. Unlike HTTP requests to a web application, Big Data applications are frequently long running. Moreover tasks run by such applications are not stateless (unlike HTTP requests). They leave behind state on local disks that may be needed for the lifetime of the application.
For example, (i) Tasks launched by Map-Reduce may run for a long time because of data skew or the number of tasks is small relative to the number of data; and/or (ii) a Hive Query may run for days and the process coordinating (the Hive JVM) this query has to run for an equivalent time.
In such a scenario, a uniform resource allocation strategy becomes a huge drawback. Incoming tasks are evenly distributed to all available and qualified nodes. Most nodes are either running active tasks or have state from previous ones that blocks Qubole's cluster management from deprovisioning the nodes and downscaling. As a result, once the cluster scales up, it's difficult to downscale—even if the current workload can be run on a much smaller number of nodes.
Such Uniform Scheduling may fit well in on-premise fixed size cluster deployment, such that—for example—a single server failure may not have much impact on running applications. This Uniform Scheduling may also help to utilize as many servers as possible and may avoid pressuring single server beyond its limitation. However, there are at least two situations where this default algorithm may cause issues.
First, Uniform Scheduling may prove to be a detriment when Tasks are long running. This may become a drawback in a cloud deployment in the context of an Auto-Scaling Cluster. A server cannot be deprovisioned (or it may undesirable to deprovision such a server) if it always has tasks running on it—and this may be highly likely because tasks are always scheduled uniformly amongst all available servers. Big Data workloads, in particular, may have lots of long running tasks. For example, (i) Tasks launched by Map-Reduce may run for a long time because of data skew or the number of tasks is small relative to the number of data; and/or (ii) a Hive Query may run for days and the process coordinating (the Hive JVM) this query has to run for an equivalent time.
Even in the case of short running tasks, a problem may arise if the task leaves behind state (such as files) on the Server that may be required over a long interval. As an example, in Big-Data processing frameworks like Map-Reduce and Spark a task may leave ‘Shuffle’ data on local disks that may be streamed to other tasks over a potentially long interval of time. When such data is left behind, downscaling nodes from the cluster may be unavailable or not permitted.
The inability (or undesirability) to deprovision may result in higher expense—as servers may not be deprovisioned even if the current workload can be accommodated on a much smaller number of servers. Accordingly, this may increase the running cost of products such as Qubole's Auto-Scaling Big-Data clusters and its multi-tenant Hive Server tier. Inability (or undesirability) to deprovision may result in higher expense—as servers may not be deprovisioned even if the current workload can be handled by a few number of servers.
Another way to characterize this behavior is through the utilization percentage of the entire cluster. For example, in the prior art this generally hovers at approximately 20-30% and is clearly not cost-effective.
Accordingly, there is a need in the art for revised task scheduling that performs non-uniform scheduling that may both support current and future processing (such as, for example, those found in cloud-based big-data processing), as well as provide economic advantages based on the same.
In response to these needs, the present invention presents a new scheduling algorithm called “Task Packing” or “Container Packing.” As discussed in greater detail below, each term may be used to describe a new resource allocation strategy that may make more nodes available for downscaling in an elastic computing environment, while preserving hot spots in the cluster and trying to honor data locality preferences. In other words, the present invention may provide a strategy to schedule tasks on a subset of the available servers instead of distributing it evenly. This may allow few servers to downscale if cluster is not utilized fully (or as desired) and hence may allow improved downscaling and cluster utilization. Task Packing may take locality and other placement constraints into account as well if feasible. In practice, such Task Packing has been seen to result in hardware savings of more than 40%.