A bug is an error, flaw, mistake, failure, fault or undocumented feature in a computer program that prevents it from behaving as intended, thereby producing an incorrect result, for example. Many bugs arise from mistakes and errors in a computer program's code or its design, and some bugs are caused by compilers producing incorrect code.
Distributed systems are directed to hardware and software systems containing more than one processing element or storage element, concurrent processes, or multiple programs, running under a loosely or tightly controlled regime. In distributed systems, a computer program is split up into parts that run simultaneously across multiple computers or nodes, with communications centralized via a network. Distributed programs often accommodate heterogeneous environments, network links of varying latencies, and unpredictable failures in the network or the nodes.
Distributed systems are becoming increasingly crucial as more and more infrastructures are distributed for performance, scalability, and reliability. Distributed systems are complicated and buggy because they should correctly handle all possible events, including rare events such as machine crashes, network partitions, and packet losses. Failures may come in the form of a node crash, a network partition, a message loss, or disk failures, for example.
Distributed systems are difficult to test due to complicated interactions between different components of the system, as well as unpredictable failures, events, and message deliveries. Complicated dependencies within a distributed system make it particularly challenging to enumerate the possible cases that the system must handle. Identifying bugs in distributed systems is also challenging. The current practice of finding bugs in distributed systems typically involves some form of random testing, such as network simulation, end-to-end testing, or analyzing logs. These techniques are not effective for finding the bugs that appear only in rare cases, and are unable to reproduce the bugs when such bugs appear in the tests.
Model checkers have been used to find errors in both the design and the implementation of distributed systems. Traditional model checkers take as input an abstract model of a system and explore the states based on the abstract model. Traditional model checkers require an abstract model of the distributed system that is to be checked. Writing an abstract model of the distributed system is costly and error prone, thus making application of model checking on distributed systems prohibitive.