The increasing complexity of electronic tasks (e.g. executable programs such as computational tasks, command execution, and data collection) has increased the demand for resources used in accomplishing such tasks. Resources include hardware that aids in completing electronic tasks, such as servers, clients, mainframe computers, networks, network storage, databases, memory, CPU time, and scientific instruments. Resources may also include software, available network services, and other non-hardware resources.
One response to the increased demand for resources has been the development of networked computing grid systems, which operate to integrate resources from otherwise independent grid participants. Computing grid systems generally include hardware and software infrastructure configured to form a virtual organization comprised of multiple resources in often geographically disperse locations.
Grid systems have become increasingly large and complex, often comprising thousands of machines executing hundreds of thousands of electronic tasks, or “jobs,” on any given day. Managing such systems has become increasingly difficult, particularly identifying and correcting errors or “exception conditions” occurring within such systems. Further, providing appropriate levels of security for grid systems has grown more challenging as grids expand in size and complexity. Thus, manual procedures for managing grid systems are quickly becoming outdated.
Exception condition monitoring has previously been accomplished by monitoring the status of the machines or “hosts” providing the resources within the grid. More particularly, exception condition monitoring has typically involved analyzing attributes of the host such as the host's memory capacity, processing capabilities, and input/output, and evaluating whether the host is operating properly. Such exception condition monitoring can be problematic as it monitors the operation of the host instead of the status of the job being executed on the host. Thus, current exception condition monitoring techniques fail to identify exception conditions associated with jobs and job execution.
Therefore, improved systems and methods for automating functions associated with managing large-scale distributed computing grid systems is desired in which autonomic monitoring is provided to evaluate the execution of jobs on the grid and to correct such execution when exception conditions are detected.