From large clusters serving as back-ends to large-scale peer-to-peer (P2P) networks, distributed systems are important to many of today's Internet services. Distributed systems can involve many nodes. In fact, these nodes can number in the tens, hundreds, thousands, millions or more nodal instances. Each instance may be, for example, a process, an application, a physical device, some combination thereof, and so forth. Each of the individual nodes of a distributed system can operate interactively with one other, with two other, or with many other nodes of the distributed system. Such interactions may occur once or may be repeated one or more times.
The multiple nodes of a distributed system usually communicate messages between and among each other. Each node also functions locally by acting on local resources. These various actions and interactions result in many different non-deterministic concurrences happening within the distributed system. The protocols of distributed systems typically involve complex interactions among a collection of networked machines, and they are faced with failures ranging from overall network problems to individual crashing nodes. Intricate sequences of events can trigger complex errors as a result of mishandled corner cases.
As a result of these concurrent events and the sheer number of nodal instances, it is especially challenging to design, implement, and test distributed systems. For example, bugs in distributed systems are usually difficult to analyze. It is even more difficult to diagnose and/or identify the cause or causes of bugs in distributed systems. In fact, the most challenging bugs are typically not the ones that crash the distributed system immediately, but they are instead the ones that corrupt certain design properties and thus drive the system to unexpected behaviors after long execution runs.