A distributed program is one in which elements of the program execute simultaneously on multiple processors. Examples of distributed programs include network services such as: simple two-party telephone calls, or conference telephone calls; shared network file systems; and Local Area Network operating systems.
Inherent in a distributed program is the problem of providing data distribution and synchronization among processors. The current approach to distributing data in a network is via messages. These messages are packetized by the application processes and distributed using complex protocol stacks such as TCP/IP or OSI. These protocols provide reliable distribution of packets and network session management. Elements of distributed programs communicate with another program via multiple point-to-point sessions which are established through the network.
The difficult problem of coordinating the asynchronous processes which comprise the distributed program is left to the application. Typically, this coordination involves a set of processors in a "wait" state because the processors have reached an "abstraction barrier", that is, a point at which certain processors cannot continue until other processors in the system reach a comparable point--at which point (in time) the processors share some data. At this point, the processors must all agree by exchanging messages that the processors all have the same values for the same set of data. The processors can proceed independently until another abstraction barrier is reached, whereupon the message interchange must again be effected. Implementing this coordination is a complex and error-prone task. For example, each processor must keep track of which other processors in the network must agree before any given abstraction barrier can be passed. This is exceedingly difficult in practice since where each process actually executes is dynamically allocated based on available total system resources. System recovery in the event of a failure of one or more elements in the network is particularly difficult. Recovery is usually triggered by timeouts which occur asynchronously in different parts of the network. In order for the system to function reliably, these timeouts must be sensitive to propagation delays in the network.
In a network of processors where multiple distributed applications are executing simultaneously, isolating problems resulting from inadvertent interference between applications comprises the most difficult and time consuming part of system testing and debugging. In single processor environments, this problem has been solved by memory protection schemes. With such schemes, virtual memory space is partitioned by the operating system and assigned to processes running within the system. If a process makes an illegal memory reference (i.e. to an address outside its assigned space), the event is detected by the hardware and trapped. A process of a distributed application, however, must reference objects outside its own memory space since its must be able to communicate with other parts of the system. Because there is no protection for this type of shared space, inadvertent accesses outside its own memory space go undetected, resulting in eventual unpredictable behavior of the system.
The underlying network itself is responsible for the actual delivery of the data packets. At its simplest level, the network may be a single wire such as an Ethernet, on which all messages are broadcast to each computer in the network. As demand grows, however, a single wire can no longer carry all the traffic, and the network becomes a set of local area networks interconnected by switching elements which direct franc between the local networks as necessary. When this happens, the operation and management of the network increases in complexity. In very large networks such as the public switched telephone network, hundreds or even thousands of these switching elements are required. In these switch networks, message routing and provision of adequate reliability in the event of hardware or software failures is an extremely complex problem. Oftentimes, such a network is described as one that does not scale gracefully. Scalability means that information required for processing is finite at each processor and is independent of the size of the network.