In some programming models for processing and generating large data sets with a parallel, distributed algorithm on a set of connected distributed servers, such as MapReduce, a non-work-conserving detect/restart model is used for fault tolerance. A master node monitors the status of each worker node. Once a failure on a worker node is detected, the master node will reschedule the affected tasks on a different worker node to recover the lost intermediate data. The failed worker node is then removed from the group. The total number of worker nodes is reduced by one. In a large scale system, reducing the number of the worker nodes may decrease the throughput of the entire system.