1. Technical Field
This invention pertains in general to distributed computing and in particular to fault tolerance in a distributed computing system.
2. Background Information
In graph processing, a computing problem is represented by a graph having a set of vertices connected by a set of edges. The edges may have associated weights indicating, e.g., a distance represented by the edge or a cost incurred by traversing the edge. The graph can be used to model a real-world condition, and then the graph processing can act on the graph to analyze the modeled condition. For example, the World Wide Web can be represented as a graph where web pages are vertices and links among the pages are edges. In this example, graph processing can analyze the graph to provide information to a search engine process that ranks search results. Similarly, a social network can be represented as a graph and graph processing can analyze the graph to learn about the relationships in the social network. Graphs can also be used to model transportation routes, paths of disease outbreaks, citation relationships among published works, and similarities among different documents.
In a distributed system, different parts of the graph may be processed by different worker systems in a cluster, and some of the worker systems' processing may depend on input received from other worker systems. Accordingly, synchronizing the processing between the cluster's worker systems and providing an efficient input delivery system is a challenge. Similarly, providing fault tolerance and recovering from failure of a worker system also presents its own challenges.