Distributed or grid computing provides the ability to share and allocate processing requests and resources among various nodes, computers or server farm(s) within a grid. A server farm is generally a group of networked servers or, alternatively, a networked multi-processor computing environment, in which work is distributed between multiple processors.
Workload is distributed between individual components or processors of servers. Work requests or workflow having one or more executable commands are used to execute work requests on a grid. Workflow includes both “jobs” and “tasks.” An example of a job is a collection of work. A work is a superclass of jobs or tasks. A job contains any number of jobs or tasks. These workflows are used to execute commands using one or more resources on the grid. A resource is generally something that is consumed during execution of a workflow or job, such as a machine on the grid. Resources that are distributed throughout the grid include various objects. An object is a self-contained module of data and associated processing that resides in a process space. There can be one or multiple object per process. These objects can be distributed throughout various portions of the grid, e.g., in various geographic locations.
Objects can reside on various servers or server farms. A server farm environment can include different classes of resources, machine types and architectures, operating systems, storage and hardware. Server farms are typically coupled with a layer of load-balancing or distributed resource management (DRM) software to perform numerous tasks, such as managing and tracking processing demand, selecting machines on which to run a given task or process, and scheduling tasks for execution.
An important aspect of managing a computing system, particularly a distributed or grid-based computing system, is the task of managing failures in a workflow. Managing workflow on a grid, however, can be difficult since a grid is less stable than, for example, a server farm. For example, certain machines on a grid may not be configured to be as stable as others. Further users attempt to use many or all available machines on the grid which, in turn, can cause a job within a workflow to fail.
Accordingly, there exists a need for a method and system for managing workflow failures within a grid. Further, there exists a need for a method and system for retrying a workflow or elements thereof upon a workflow failure. Embodiments of the invention fulfill these needs.