An often employed strategy for solving complex problems is to break a single complex problem into a number of smaller more manageable problems and then to coordinate results from all of the smaller problems. This strategy can be applied recursively. In the field of information processing systems, complex problems are managed using several well known techniques such as grid computing, gang scheduling, scavenger scheduling and supercomputing.
The term “grid computing” emerged in the early 1990's using the metaphor of electric power grids to describe its essence. Just as electric power grids enable easy access to electric power for all, so do computing grids provide easy access to computing resources. A “task” or “job” is described to a grid computing system, and the grid computing system then provides the power to carry out that task. Tasks range in complexity from a very simple single thread or process, i.e. a sub-task, to a very complex and large collection threads and processes.
The term “gang scheduling” refers to dispatching multiple sub-tasks, for example threads and processes that constitute a given job, in parallel with the expectation that each individual sub-task has a consumer-producer relationship with one or more of the other individual sub-tasks. By scheduling multiple sub-tasks together, the wait time for interdependent sub-task interaction can be reduced. Conversely, many sub-tasks may be dispatched that are never employed, and sub-tasks are never shared between jobs. One example of gang scheduling is the LLNL Gang Scheduler. In the LLNL Gang Scheduler, jobs are classified according to priority and dispatched accordingly. Using this scheme, jobs compete with each other for the available computing resources.
The term “scavenger scheduling”, e.g., Condor and LoadLeveler, describes the harvesting of unused central processing unit (CPU) cycles. Typically, an interactive user's system is idle during various periods of the day, for example during lunch hour and late night hours. Therefore, the interactive system is available to loan for other tasks during these idle periods. Scavenger systems “collect” both jobs and idle systems and perform a matching algorithm to dispatch jobs to idle systems. However, scavenger systems often lose and gain processors unpredictably. In addition, a significant amount of time can be wasted just preparing to run a job on an idle processor, and this amount of time can exceed the actual running time of the job. For example, setup and dismantle time can exceed the amount of time actually spent running the job.
The term “supercomputing”, e.g., Cray-2 and Blue Gene, has evolved over the years. In the 1970's the term referred to fast scalar processors, and in the 1980's, the term referred to vector processors. By the 1990's, the term had evolved to include parallel processors, where replication of “basic” processors could be on the order of 1000's to form a physical supercomputer. Presently, clusters computers, which are usually homogeneous and sometimes heterogeneous, utilize high-speed interconnections to form a virtual supercomputer. Each supercomputing platform provides the foundation for solving complex problems; however, the determination of how to employ the assembled processing power remains with the operator of the computing system.
These solutions to solving complex problems in a computing environment lack desirable functionality. Each of these different paradigms attempts to solve a particular aspect of managing a large-scale fault-tolerant computing environment by essentially employing a one-size-fits-all approach to scheduling and dispatching. None of the prior approaches adequately managed large-scale distributed jobs in the presence of user-defined policies, e.g., security, priority, consumption. Policies are used to define the ways in which system components can interact with each other. In general, the policies specify event-condition-action tuples but do not give a clear and reliable picture of all possible runtime flows through a system. In addition, prior solutions to complex problem management were not adept at understanding, observing and coordinating job behavior or handling catastrophes such as task component failure, node failure, network faults, lost messages and duplicate messages.
Furthermore, managing the state of a large job in an undisciplined fashion is prone to faults requiring human intervention, an unacceptable circumstance for jobs comprising dispatching of sub-tasks to hundreds, thousands, or even tens-of-thousands of nodes. Even more complexity is introduced when sub-task sharing between jobs is enabled.