Distributed or grid computing provides the ability to share and allocate processing requests and resources among various nodes, computers or server farm(s) within a grid. A server farm is generally a group of networked servers or, alternatively, a networked multi-processor computing environment, in which work is distributed between multiple processors. Workload is distributed between individual components or processors of servers. Networked servers of a grid can be geographically dispersed. Grid computing can be confined to a network of computer workstations within a company or it can be a public collaboration.
Resources that are distributed throughout the grid include various objects. An object is a self-contained module of data and associated processing that resides in a process space. There can be one object per process or tens of thousands of objects per process.
A server farm environment can include different classes of resources, machine types and architectures, operating systems, storage and hardware. Server farms are typically coupled with a layer of load-balancing or distributed resource management (DRM) software to perform numerous tasks, such as managing and tracking processing demand, selecting machines on which to run a given task or process, and scheduling tasks for execution.
In any computing or processing systems, however, resource conflicts may occur for usage patterns that create a “deadlock” situation. Referring to FIG. 9, a classic deadlock situation 900 exists when a first processing entity 910 is waiting for a first resource 920 that is presently locked by a second processing entity 912. The second entity 912 cannot release the first resource 920 until the second entity 912 has completed its processing. The second entity 912 cannot complete its because the second entity 912 itself is waiting for a second resource 922 to be freed before the second entity 912 can continue its processing. The second resource 922 is locked by the first processing entity 910, which cannot release the second resource 922 until the first processing entity 910 has processed the first resource 920. As a result, a deadlock 900 exists since neither processing entity can proceed because the resource needed for processing by each processing entity is held by the other processing entity.
Deadlocks can become much more complex in a real-world workflow processing when there can be a large number of processing entities (e.g., processes, threads, users, nodes, etc.) involved in a deadlock situation. With a workflow, when a job request is issued, (e.g., using a Job Request Language (JRL), the job request may include dependencies, aggregation, conditional dependencies, and retries, which can cause or complicate deadlock situations, which can cause unacceptable levels of delays and processing inefficiencies.
In a normal processing system, it can be difficult to identify and address deadlocks. Identification of deadlocks can become even more difficult when deadlocks occur in a distributed computing networks, such as grid-based processing systems, that may include multiple processing entities in different grids in a job-controlled networked environment.
Accordingly, there exists a need for a method and system for detecting and addressing deadlocks in a grid-based computing system. Embodiments fulfill these needs.