In more detail, a distributed computing architecture (see FIG. 1) consist of software agents installed on a number of worker clients (5), and one or more dedicated distributed computing management servers (2). There may also be requesting worker clients with software that allows them to submit jobs along with lists of their required resources.
An agent running on a processing worker client detects when the system is idle, notifies the management server (2) that the system is available for processing, and requests an application package from the server and runs the software when it has spare CPU cycles, and sends the results back to the server.
The distributed computing management servers (2) have several roles. They take distributed computing requests (1), divide their large processing tasks into smaller units of works (jobs; 17) that can run on individual systems, send application packages and some client management software to the idle worker client that request them (15; 16), monitor the status or the job being run by the worker client, and assembles the results sent back by the client (18).
If the management server does not hear from a processing worker client for a certain period of time, because the user has disconnected his system or he is using the system heavily for long periods, it may send the same application to another idle system. Alternatively, it may have sent out the package to several systems at once, assuming that one more set of results will be returned. The server is also managing any security, policy, or other management functions as necessary.
The complexity of a distributed computing architecture increases with its size and type of environment. A larger environment that includes multiple departments, partners, or participants across the Web requires complex resource identification, policy management, authentication, encryption etc.
Obviously, the applications itself must be suitable for distributed computing.
Prior Art
In distributed computing environment with many worker clients there is the problem to assure the completion of the job assigned to a specific worker client if the worker client fails, e.g. due to a loss of networking connection or its over-utilization. The present approach to solve that problem is to assign that job to another worker client (failover system) and to restart that job on that new worker client from the beginning. An essential disadvantage is that the job computation already done by the failed worker client is lost at least until the checkpoint if checkpointing is implemented.
The term checkpointing as used in the present patent application means a designated point in a program where processing is interrupted and all status information is recorded in order to restart the process at that point, thus avoiding necessity to repeat the processing from the beginning.
Furthermore, that approach requires to detect the failure of the worker client by either a so called heartbeat (very resource intensive and difficult to implement in a distributed computing infrastructure), or by a timeout set to the estimated completion time plus an additional safety margin. However that implies that the distributed computing management server restarts the computation at a point in time when the computation should have been completed. The result is a large delay in finishing computation.
If checkpointing is implemented, there are two possible layers where it could be implemented.
Checkpointing on the worker client protects against application software failures. The worker client can automatically restart the computation of the assigned workload.
Checkpointing on the central distributed management server protects against all failures in the distributed computing infrastructure. However it is very expensive in terms of resource consumption. Every worker client needs to stay in contact with the central distributed management server, e.g. requiring reliable network connections and lots of computing power on the management server.