In a communications grid that includes a network of computer nodes executing a job, a node may fail. A failure of a node may cause a failure of the entire grid, and therefore a failure of the entire job, causing the job to be restarted from the beginning. For a job that includes a large data set or that may take a long period of time to complete, such a failure may be especially problematic.