Computing environments may use a cluster of remote resource nodes to execute jobs. The resource nodes are usually computer processors, or full computing elements. Clusters may be deployed to improve speed and/or reliability over a single computer.
Grid computing is a form of cluster computing which uses the resources of many separate computers connected by a network. Grid computing allows the execution of distributed algorithms on a cluster of heterogeneous machines. These machines usually have very different computing power and resources (memory, networking, etc.). This is both a challenge and an advantage to scheduling algorithms.
Job schedulers provide a means to send jobs for batch execution on remote computers. In this context, a job is defined as a set of processes which should run on a single computer or on multiple computers, in parallel. One of the main challenges of job schedulers is resource allocation. This is the ability to match jobs with the exact resources needed to run the jobs. Resources can vary from hardware requirements (e.g., the number of cluster nodes) or the required memory of each node, to prerequisite software packages.
In general, all known approaches to this problem are either static or dynamic approaches. In the former, there is an a priori mapping between resources and jobs. The latter requires that a job be submitted together with a specification of resources (types, amount) it needs for successful execution e.g., number of machines, amount of memory, predefined software packages. Once a job is submitted, it is added into a job queue. The scheduler fetches jobs from the queue according to a predefined policy (e.g., FIFO, shortest-job-first). Then, it selects resources for the job. This phase is called “resource allocation”. If the required resources are found, the job is launched for execution on the selected computer. Otherwise, the job scheduler may try another job from the queue.
Resource allocation becomes even more challenging for grid infrastructure where the clusters are composed of heterogeneous resources which may join and leave these cluster dynamically. Naturally, static approaches do not fit with these grid infrastructures and dynamic approaches are not well optimized for heterogeneous platforms. The main problem stems from the fact that user often over estimates the job requirements. As a result of over-estimation, jobs may occupy extra resources while blocking other jobs which could otherwise used these resources.
Consider the following scenario: Assume two machines M1 and M2 and two jobs J1 and J2. Assume M1 has larger memory size than M2. Initially, J1 can run on either M1 or M2. However, the resource allocation matches it with machine M1 since the user requested a memory size larger than that of M2, but which is possible for M1. Later J2 arrives. Due to its memory size request, the only machine it can use is M1. Now J2 is blocked until J1 completes or a new node with at least the same memory size as M1 is added to the cluster.