A distributed computing system is a group of processing units, frequently referred to as “nodes”, that work together to present a unified system to a user. These systems can range from relatively small and simple, such as multi-component single systems, to world-wide and complex, such as some grid computing systems. These systems are usually deployed to improve the speed and/or availability of computing services over that provided by a single processing unit alone. Alternatively, distributed computing systems can be used to achieve desired levels of speed and availability within cost constraints.
Distributed systems can be generally described in terms of how they are designed to take advantage of various computing concepts, including specialization, redundancy, isolation, and parallelism. Different types of systems are distinguished by the tradeoffs made in emphasizing one or more of these attributes and by the ways in which the system deals with the difficulties imposed by distributed computing, such as latency, network faults, and cooperation overhead.
Most distributed systems have multiple comparable processing units available for work. If there is a problem with any particular processing unit, other units can be brought in to handle the requests which would have gone to the problem unit. For example, some services are deployed on “clusters,” or interconnected groups of computers, so that the service can continue even if some of the individual computers go down. The resulting reliability is generally referred to as “high availability” and a distributed system designed to achieve this goal is a high availability system.
Part of the reason distributed systems use redundancy to achieve high availability is because each processing unit can be isolated from the larger system. Intrusions, errors, and security faults can be physically and logically separated from the rest of the system, limiting damage and promoting the continuing availability of the system. Further, distributed systems can be designed so that errors in one node can be prevented from spreading to other nodes. The ability to deal with errors is generally referred to as “fault tolerance.”
One difficulty with fault tolerance is that the cost and complexity of the system grows faster than the gains in reliability from fault-tolerant systems. Thus, systems are generally engineered to be fault tolerant only within certain limits. One consequence is that fault-tolerant, high-availability systems frequently degrade gracefully under moderate pressure, but severe pressure can cause abrupt instability and experience cascading failures, even across isolation boundaries.