The present invention relates generally to distributed systems, and in particular to self-directed distributed systems.
Computers initially consisted of a single machine built from processors, memories, and Input/Output devices. Now, many computers are interconnected to form networks and distributed systems. A distributed system is a collection of computers that acts like a single machine to its users. In other words, the users are not aware of the existence of multiple independent computers.
Many goals motivate connecting widely separated computers together in a distributed system. A distributed system allows its users to conveniently share load, data, documents, and ideas. Distributed systems also allow users to take advantage of unique resources such as unusual computers, Input/Output devices, and large databases located at remote sites. If the workload at one computer becomes more than that computer can handle, some of it can be transferred to another computer with a lighter workload. Furthermore, the reliability of a distributed system can be much higher than the reliability of any of its components.
The processors in a distributed system can be used in several ways. For example, they can be used as dedicated processors, where each processor performs a specific function, or as pool processors. Pool processing can be more efficient than dedicated processing because pool processors do not have specifically assigned tasks. Instead, when a user needs to run a program for which no server processor exists, one or more processors from the pool are temporarily assigned. When the job is finished, the processors are returned to the pool to wait for reassignment.
In this configuration, it is advantageous to design programs as a collection of cooperating processes, to allow each process to run on a separate processor, and thus be faster than having them all share a single processor.
However, there is additional complexity when multiple processors work simultaneously and support multiple asynchronous tasks which execute concurrently. Each process executes with unpredictable speed and generates actions or events which must be recognized by other cooperating processes. Therefore, cooperating processes in a multi-processor environment must often communicate and synchronize with each other. Execution of one process can influence the other via communication. Often the processes that communicate do so via a synchronization mechanism. The synchronization mechanism is used to delay execution of a process in order to satisfy ordering of actions among cooperating processes.
For example, when several cooperating processes compete for a certain type of resource, such as a printer or a data base, the resource must be controlled so that it is never in use by more than one process at a time (mutual exclusion). For proper operation of processes, it is necessary that the resource be granted to at most one process at a time since processes modify the state of the resource.
When resource usage is under the centralized control of an operating system, mutually exclusive use of resources is implemented via conventions used by processes for signalling the operating system that specific resources are requested or released. However, when resource usage is not under centralized control of an operating system, control is based on system status variables residing in each processor (self-directed). The processes themselves must bear the responsibility for controlling their progress to implement mutual exclusion.
A desirable method for mutual exclusion in a self-directed system must account for the varying speeds of processes executing on different processors, and for possible "race" conditions: For example, two processes on different processors can start acquiring a desired resource before either has had sufficient opportunity to prevent the other from acquiring the resource (race condition). Moreover, no process requesting the use of the resource can be waiting indefinitely for other processes requesting or using the resource. Preferably, under no circumstances should mutual exclusion be achieved by completely blocking the use of the resource from any one or more of the processes requiring it. Therefore, asynchronous processes in a multi-processor environment must communicate and synchronize to implement proper resource allocation.
Since a distributed system is a collection of interconnected processors, the performance of the distributed system is highly sensitive to communication time between the processors. Existing protocols for mutual exclusion do not account for communication delays caused by the interconnection topology in a distributed system and are therefore unsuitable for use in distributed systems. The protocols for registering and reacting to status information to determine system behavior mostly apply to multi-tasking on a single processor where interprocessor communication is not an issue.
There are protocols for mutual exclusion in multi-processor systems where several processors are interconnected. These protocols assume inter-processor communication times in the order of instruction times. However, the timing and synchronization assumptions of these protocols are inappropriate to distributed systems because signal propagation times in distributed systems exceed instruction times. Therefore, a seemingly workable and efficient protocol which assumes fast signal propagation speeds will be unworkable or inefficient for a computing environment where signal propagation times are higher than the protocol assumes because the signal propagation delay must be added to the processing time.
Other protocols require every process desiring to use a shared resource to first broadcast a signal to all other processes and then find that a signal from another process for the same resource has not arrived. A major problem with such protocols for mutual exclusion of asynchronously interacting processes in self-directed distributed systems is that they call for the broadcast of a preempt signal, and then a wait delay until propagation is completed, before testing the availability of a shared resource. The wait delay slows down the system.
A further problem is that every process has to wait for the longest propagation time. This results in further degradation of system response time. Often, distributed systems are utilized in interactive or real time applications where response time is critical. Example of such applications include operations, security, defense, air traffic control, etc. For example, in an air traffic control room, as result of delays in gaining access to a shared resource, the system response times might be dangerously slow and two controllers might get delays or locked out, resulting in minutes of blacked out screens. Where processors are more distant, problems of this nature become more serious.
Even in those applications where response time is not critical, expensive processing time is wasted by idling processors while broadcast signals make their way throughout a distributed system to sort out who gets what resource.
Advances in semiconductor technology and circuit design have enabled modern processors to operate at increasingly higher speeds. Every new generation of processors is designed to surpass the previous ones in terms of speed. As the gap between processor and communication speeds widens, existing protocols for mutual exclusion in distributed systems become even more inefficient, if not impractical. That is because the processors idle while waiting for signal propagation. Furthermore, any upgrade of the processors in such distributed system for speed is stifled because the slow communication speeds limit the performance of the entire distributed system.
Thus, there is a need for a method for mutual exclusion of asynchronously interacting processors in self-directed distributed systems wherein the processors can operate with close to minimum delay. There is also a need for such a protocol whereby the system can be reliably utilized in time critical applications. There is also a need for such a protocol whereby investments in processor upgrades are not stifled by slow signal propagation speeds. There is also a need for such a protocol wherein computing time is not wasted by idling processors.