Distributed resource management tools such as the Sun Grid Engine (“SGE”) and Slurm enable higher utilization, better workload throughput, and higher end-user productivity from existing compute resources. See, Templeton, 2008, “Beginner's Guide to Sun Grid Engine 6.2,” White Paper; and Pascual et al., 2009, “Job Scheduling Strategies for Parallel Processing,” Lecture Notes in Computer Science, 5798: 138-144. ISBN 978-3-642-04632-2. doi:10.1007/978-3-642-04633-9_8. For instance, SGE transparently selects the resources that are best suited for each segment of work, and distributes the workload across a resource pool while shielding end users from the inner working of the compute cluster. First, it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending jobs. Similarly, SLURM (i) provides exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work, (ii) provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes, and (iii) arbitrates contention for resources by managing a queue of pending jobs.
Thus, central to such distributed schedulers is that users, who have computational jobs to be performed, represented by script, submit their scripts to the distributed scheduler, such as SGE or SLURM, and the scheduler finds a computer in a network that is available to run the computational job.
A drawback with such conventional schedulers is that they were developed prior to cloud computing. One aspect of cloud computing is that the network that is available to run a computational job is dynamic. When computational resources are not required, end users do not need to pay for them. In other words, rather than being a fixed size, the available cluster of computing resources can be scaled up or down on a dynamic basis as a function of current computational need. Conventional schedulers do not satisfactorily handle this dynamic element of cloud computing. For instance, if SGE is applied to a cloud based computing network and one of the computers in the network disappears (because the network is being scaled down due to current decreased computational demand), SGE does not handle the situation satisfactorily.
With the advent of cloud computing, operations groups running distributed computing jobs expect to be able to add and renew resources to clusters without having to restart nodes. However, such a feature is not satisfactorily supported by conventional distributed computing schedulers.
Moreover, sole reliance on cloud based solutions for distributed scheduling of computing jobs has drawbacks, particularly in instances where the distributed computational jobs require breaking a dataset into tens, hundreds, or thousands of chunks that are each processed on independent CPU cores using algorithms that takes the independent CPU cores minutes, tens of minutes or hours to complete. For instance, some cloud based solutions, such as AWS batch, spin up an entire virtual node for each such chunk. See the Internet, at aws.amazon.com/blogs/aws/aws-batch-run-batch-computing-jobs-on-aws. This results in a two-to five-minute overhead per submitted job, and thus substantially reduces the efficiency of short jobs. It also reduces efficiency of jobs which do not perfectly fit the memory or processor availability of the computer they are run on. Another cloud based solution is AMAZON WEB SERVICES' (AWS) EC2 Spot Instances. See the Internet at aws.amazon.com/ec2/spot/. AWS EC2 Spot Instances is a real-time (second price) auction where customers (or software running on behalf of customers) submit electronic bids for computers. The bid is active, and customer get access to the computer and is charged for it, until the customer gives up the computer or someone else offers a higher bid. Like on demand instances provided by AWS, the customer can select a pre-configured or custom Amazon Machine Image (AMI), configure security and network access to their Spot instance, choose from multiple instance types and locations, use static IP endpoints, and attach persistent block storage to their Spot instances. Similarly, the customer can pay for each instance by the hour with no up-front commitments. Other cloud based solutions, such as AWS Lamda, are designed to work with small computing projects. See the Internet, at aws.amazon.com/lambda/. AWS Lambda is not optimized for larger jobs that run for longer, such as a pipeline that requires 30 CPU cores for several hours. Additionally, such cloud based solutions have the drawback of supporting only some programming languages, such as Node.js, Java, Ruby, C#, Go, Python, or PHP, while offering unsatisfactory support, no support, or outright prohibiting other programming languages. If cloud based solutions did not time out, provided ample memory support for each chunk, did not spin-up a complete virtual node for each chunk, imposed no restrictions on which programming languages can be used, and did all this in a cost effective manner, then distributed scheduling solutions may not be necessary. However, in practice, cloud based solutions do have the above-identified drawbacks. Accordingly, improved distributed scheduling, even in the context of cloud computing resources, is necessary in order to ensure that each job has the proper resources and is being run as economically as practically possible.
Given these circumstances, what is needed in the art are improved distributed scheduling tools that can handle the dynamic environment of cloud based computing, where resources in the computing network emerge and disappear on a dynamic basis.