In distributed computing, different computers within a network share one or more resources. Essentially, grid computing is a form of distributed computing. In a grid computing system, grid resources are shared, turning a loosely coupled computer network into a “super virtual computer.” A grid computing system (also referred to herein as simply the “grid”) can be as simple as a collection of similar computers running on the same operating system or as complex as inter-networked systems comprised of every computer platform one can think of. With a proper user interface, accessing a grid computing system looks no different than accessing a local machine's resources. Every authorized computer has access to enormous processing power and storage capacity. Thus, grid computing systems work on a principle of pooled resources.
In high performance computing (“HPC”), “preemptive scheduling” refers to a process whereby a pending high-priority workload takes resources away from a currently running workload of a lower priority, whereby a program managing workload distribution designates the relative priorities of scheduled workloads. A workload (also interchangeably referred to herein as a “job”), refers to a set of tasks and/or processes to be performed to accomplish a desired end result and/or create an output.
Referring to FIGS. 4-6, current grid management tools release preempted workload resources in one of three ways: by suspending the workload, by saving its state and moving it (also referred to as “check-pointing it”), or by killing and rescheduling it. Suspending a workload means that the system will pause it until the very same resources it was using are again available. Check-pointing a workload means that the system will save its state in external storage, terminate the process, and then restart it from the last saved point after finding new resources on which to run it. Killing a workload means that the system will terminate the process, and reschedule it to run from the beginning.
FIG. 4 illustrates a state transition diagram of a low-priority workload currently running (state 401) on a grid, which is preempted by killing it. When the action taken to preempt is killing it (action 402), the workload is terminated and returned to the pending queue (state 403) to be rescheduled, losing any work it had already performed. When it is resumed (action 404), it has to start from the beginning, returning it to a running job (state 405).
FIG. 5 illustrates a state transition diagram of a low-priority workload currently running (state 501) on a grid, which is preempted by suspending it. Suspending the workload (state 503) improves on the previous situation in FIG. 4 because the job is paused and retains all the work it had done up to when it was paused (action 502). However, the penalty for suspension is that the job can only be resumed on the same resources/hosts that it was previously running on (because this is where its state was saved). This means that the paused low-priority workload must wait for the higher priority job that interrupted it to end and release its resources (action 504) before resuming (state 505), whereas if the low-priority workload had been killed as in FIG. 4, it would immediately be free to restart on any resource that becomes available in the grid.
FIG. 6 illustrates a state transition diagram of a low-priority workload that is periodically check-pointed, then preempted, then resumed from a saved check point. Check-pointing may be considered the best of both worlds because it externally saves the preempted workload's state so that it can be resumed anywhere in the grid. As the workload is running (state 601), its state is being periodically or dynamically saved (action 612) to an external location 610. When the workload is killed (action 602), it is returned to the pending queue (state 603) to be rescheduled in the grid. Once it is rescheduled (action 604), the preempted workload can then be restarted on the new grid resources using its most recent state (action 614), which had been saved in the external storage 610.
Notice, however, that in each case illustrated in FIGS. 4-6, the workload goes from a state in which it is running (e.g., states 401, 501, 601), to one in which it is not (e.g., states 403, 503, 603) because it is interrupted by the preemption process. Often, this essentially amounts to killing the workload even if pausing was the intent: for instance, any licenses that the workload had been using might have been reclaimed in the interim, or its network connections may have timed out. For such reasons, the workload may not be able to resume or restart after it is preempted, regardless of which preemption action was taken.