1. Field of the Invention
The present invention relates generally to a data processing system and in particular to a method and system for scheduling grid jobs. More particularly, the present invention is directed to a computer-implemented method, apparatus, and computer-usable program code for scheduling grid jobs using a dynamic grid scheduling policy.
2. Description of the Related Art
In the 1990's, computer scientists began exploring the design and development of a computer infrastructure, referred to as the computation grid, whose design was based on the electrical power grids that had been known to date. Grid computing was initially designed for use with large-scale, resource intensive scientific applications, such as the Search for Extraterrestrial Intelligence (SETI) program's computing grid. This type of application requires more resources than a small number of computing devices can provide in a single administrative domain. Since then, grid computing has become more prevalent as it has increased in popularity as a mechanism for handling computing jobs. For example, grid computing is implemented by the World Community Grid, which is an Internet-connected grid of computers that are used to advance public interest and scientific research projects for the benefit of humanity.
Computing jobs that are executable by a grid computing system are grid jobs. A grid job is any job capable of being utilized on a grid computing system. For example, a grid job may perform a mathematical calculation or obtain a sampling result based on input for a particular sampling formula. Grid jobs may also perform computations in a larger grid project. Examples of grid projects include applications for modeling protein folding, financial modeling, earthquake simulation, and climate/weather modeling.
A computation grid enables computer resources from geographically distributed computing devices to be shared and aggregated in order to solve large-scale resource intensive problems. A computational grid may also be referred to as just a “grid.” To build a grid, both low-level and high-level services are needed. The grid's low-level services include security, information, directory, and resource management services. The high-level services include tools for application development, resource management, resource scheduling, and the like. Among these services, the resource management and scheduling tends to be the most challenging to perform optimally.
Static schedulers manage a dedicated set of resources on a set of nodes that are defined by the configuration of a computing device on which the static scheduler is executed. A node is a computing device that contains resources, such as processing speed, memory capacity, and hard disk space. Examples of static schedulers include Load Sharing Facility from Platform Computing and Portable Batch System from Altair. Computing jobs that need to be executed are held in a job queue managed by the static scheduler. A static scheduler then uses a static scheduling policy to determine how resources in the set of nodes managed by the static scheduler are utilized.
Static scheduling policies are often defined for particular classes of applications that require particular types of resources. For example, some static scheduling policies may specialize in allocating node resources to parallel jobs that require identical platforms. Other static scheduling policies may specialize in executing jobs on nodes having resources that are logically partitioned based on factors such as processing speed, architecture, and type of operating system.
Static scheduling policies can fail to efficiently utilize resources for applications and resources that fall outside the scope of the policy. In one example, multiple parallel jobs that require multiple nodes may be executed by nodes that are managed by a static scheduling policy that specializes in scheduling parallel jobs. However, in this example, inefficiencies can occur when an insufficient number of nodes are available to process the next parallel job in a job queue. The static scheduler may then fail to utilize idle nodes to execute non-parallel jobs. For example, non-parallel jobs that could be run on the idle nodes may be queued in a job queue that is not associated with the nodes.
Also, the static scheduler associated with the idle nodes may not have the functionality to execute non-parallel jobs. As a result, resources that could be used to execute non-parallel jobs are left idle, and are therefore under-utilized. Although some static schedulers use a backfill policy to utilize idle nodes, jobs that are submitted to a static scheduler tend to belong to the same job class, thereby preventing the utilization of idle nodes. For example, a static scheduler having the infrastructure for parallel processing may fail to receive non-parallel jobs for processing if the user or application submitting jobs to the static scheduler submits only parallel jobs.
The above example also illustrates that static schedulers having static scheduling policies are ineffective at scheduling jobs in a heterogeneous grid computing system having multiple application and resource types. A heterogeneous grid computing system is a grid computing system in which at least two static schedulers in the grid computing system have different properties. For example, the static schedulers may have different statuses, manage different resource types, administer different policies, or specialize in different application or job types. The static schedulers in a heterogeneous grid computing system may also originate from different sources, entities, or manufacturers. In a heterogeneous grid computing system that is managed by a set of static scheduler policies, no single global policy exists that dynamically schedules jobs across resources managed by the set of static scheduler policies while still taking into account the instructions set forth by the set of static scheduler policies.
Also, each static scheduler that manages nodes in a heterogeneous environment may use a different system specific language. Because the various languages of the multiple static schedulers are not normalized into a uniform medium, problems may arise in coordinating the policies, status information, and resource availability information of each static scheduler into a comprehensive policy for managing all of the nodes in the heterogeneous environment.