Grid computing infrastructures have been developed over the past several years to enable fast execution of computations using collections of distributed computers. Among the most important remaining challenges is to achieve efficiency of executing large-scale, sophisticated computations using unreliable computers. These problems are manifested, for example, in the Sloan Digital Sky Survey computations that have sophisticated task dependencies. When a computer fails to correctly execute an assigned task, then the progress of execution may be delayed because dependent tasks cannot be executed pending successful execution of the assigned task. It is conceivable that task dependencies and worker reliabilities play a significant role in the ability to execute a computation quickly. Therefore, one would like to determine relationships among these factors, and develop algorithms for quick execution of complex computations on unreliable computers.
A similar problem arises when managing projects such as production planning or software development. Here a collection of activities and precedence constraints are given. Workers can be assigned to perform the activities. In practice, a worker assigned to an activity may fail to perform it. For example, if an activity consists of writing a piece of code and testing it, it could happen that the test fails. The manager of the project may be able to estimate the success probability of a worker assigned to an activity based on prior experience with the worker. The manager may be able to redundantly assign workers to an activity. For example, two workers may independently write a piece of code and test it; if at least one test succeeds, the activity is completed. Thus the manager faces a problem of how to assign workers to activities, possibly in parallel and redundantly, over the course of the project, so as to minimize the total time of conducting the project.