The prior art discloses numerous communication protocols by which the processors in a multi-processor data processing system share information. The particular communication protocol employed depends on the particular design and operating constraints of the multi-processor system.
A multi-processor system configuration may be thought of as a plurality of processing units sharing a logical communications channel. The logical communications channel may take the form of memory shared among the processing units in which messages from one processing unit to another processing unit may be placed. Alternatively, the logical communication channels may take the form of a communications network through which messages from one processing unit to another processing unit may travel.
For communication purposes, such prior art multi-processor computer systems can be generally categorized as tightly-coupled systems, closely-coupled systems, and loosely-coupled or distributed multi-processor systems.
Tightly-coupled systems have identical processing units which are physically very close to each other, can access the same memory and run the same operating system. The communications media between them are extremely fast. They may consist of shared memory, or they may include signalling over proprietary buses or other methods specific to the computer system in question. The communications protocols used are also very specific, special-purpose protocols which may be entirely implemented in hardware, but in any case add very little overhead to the communication. The advantage of such a system is the ability to use several processors together to process the workload of the system.
Distributed systems are physically separated, by as little as a few feet or as much as thousands of miles. The communications medium is typically an industry standard such as telephone line, satellites, local area networks like Ethernet.TM..sup.1 or Token Ring.TM..sup.2. The processors in a distributed system may be totally different from each other. Such systems frequently run entirely different operating systems and are entirely independent of each other, but cooperate to allow the sharing of data. They allow data to be distributed over more systems as the volume of data increases and to be duplicated in more than one system for higher availability. The communications protocols used by such distributed systems tend also to be industry standards, such as Systems Network Architecture ("SNA.TM.").sup.3 or Transmission Control Protocol and Internet Protocol ("TCP/IP"). FNT 1 Ethernet is a registered trademark of Xerox Corporation. FNT 2 Token Ring is a trademark of IBM Corporation. FNT 3 SNA is a trademark of IBM Corporation.
Closely-coupled or "clustered" systems attempt to combine the advantages of the two other organizations. They are typically at least in the same building, if not the same room, and may use either a standard communications medium such as Ethernet or a special-purpose one such as Digital Equipment Corporation's cluster interconnect bus. The processors are usually similar and compatible with each other. They run the same operating system in each machine and they cooperate far more closely than in distributed systems, to allow other facilities besides data sharing. The goal is generally to give the user the illusion of a single system.
More recently, a proposal to interconnect a plurality of virtual memory data processing units in a clustered configuration was disclosed in a paper entitled "Memory Coherence in Shared Virtual Storage Systems" authored by Kai Li and Paul Hudak and presented at the Fifth Annual Association for Computing Machinery Symposium on Principles of Distributed Computing, 1986. In the proposed cluster of machines, all the units have the same type of operating system and can address the same virtual memory space.
Each unit of the clustered configuration therefore shares the set of addresses in its virtual memory system with the other units in the configuration and the page fault handling mechanism is extended to fetch pages from other units as well as from that processor's secondary storage. When a unit of such a clustered system suffers a page fault, it is possible to handle the page fault by requesting a copy of the page from another unit rather than from secondary storage. This has the advantage that the other unit may have the page in its memory, and can respond far faster than a secondary storage device can. It can be seen that there may be copies of a given page in several units of such a cluster, so a unit which suffers a page fault may not know where to request a copy of the page. Further, unless special steps are taken, anomalies, such as two units changing a page at the same time, may occur. Also, it is important to ensure that when a page is read, the results of the most recent write operation are seen, even though the reader and writer may be in physically separate processors. In order for this kind of sharing to work properly, a system protocol may be established to do such things as grant permission to change a page, find the owner of the page, and determine when to return a page to the owner. This kind of system protocol involves a great deal of communication between the systems in the various units.
In the past, numerous standard communications protocols have been developed for transmitting information between remote processors, including the SNA developed by IBM Corporation and numerous specific protocols for use with SNA as exemplified by the "document interchange protocol" described in U.S. Pat. Nos. 4,648,061 and 4,532,588.
A fundamental problem in communication systems which must be addressed by communication protocols is whether there is sufficient buffer space in the memory of the receiving processor to receive the message. In protocols typically used in the past in distributed systems, even if there is not sufficient buffer space, n acknowledgement (ACK) at the communications link level has already been sent to the sending processor by the receiving processor which informs the sending processor that the message was successfully received. The communications link level acknowledgment or "ACK" is normally sent by hardware such as a communications link adapter or by lower levels of software. Even though the link level "ACK" has been sent informing the sending processor that the message was successfully received by the receiving processor, the message must be discarded by the receiving processor if there is not sufficient buffer space available in the receiving processor for the message. Thus, at a higher software level, another message has to be sent by the receiving processor to the sending processor that the message had to be discarded or that there were no buffers available for the message received. This fundamental problem has created the necessity of employing complex hardware and software architecture such as SNA, especially in a distributed data processing system.
In loosely-coupled, multi-processor configurations disclosed in the prior art which employ a message-passing model, the individual kernels running on separate processing units send messages containing requests for service to other processing units within the configuration that manage configuration-wide shared resources. Reliance on a "message-passing" model has undoubtedly occurred because message passing corresponds naturally to the underlying communications connections among the processing units, which is generally believed to compose the primary performance bottleneck in a loosely-coupled configuration; however, message passing as a model for system coupling has several drawbacks.
The difficult of directly sharing complex data structures (e.g. control blocks containing pointers) among processors in message passing systems is well-known and is discussed in a paper entitled "A Value Transmission Method For Abstract Data Types" by M. Herlihy and B. Liskov and published in the ACM Transactions on Programming Languages and Systems, Vol. 4, No. 4 in October of 1982, which is herein incorporated by reference. This subject is further discussed in a doctoral thesis entitled "Remote Procedure Call", by B. Nelson, and published by Carnegie Mellon University in May of 1981, incorporated herein by reference.
In order to share a list of elements between two components of an operating system executing on separate processing units within a multi-processor configuration, which is itself a relatively common requirement, the elements have to be packed into a format suitable for transmission at the sending component, transmitted from the sending component to the receiving component, then unpacked at the receiving component. This sequence of operations is inefficient in processor utilization.
More importantly, this sequence of operations is complex and unwieldy. The primary drawback of message passing is that it forces both the sending and receiving components into awkward and complex architectures that tend to be costly and difficult to implement, debug, augment, and maintain. Since the kernel of a typical general purpose operating system tends to be composed of many interacting components, the implications of designing the operating system of a multi-processor configuration around a message passing model tend to be large.
In a clustered-type configuration where the plurality of units share virtual memory similar to tightly-coupled units, a communications protocol is needed to effect rapid transfers between the kernels of the operating system in the various units. The selected protocol should address the high level of communications traffic associated with such issues as: obtaining a page from the shared virtual memory in the event of a page fault, how to manage concurrent requests for the same page if one unit wants to write to that page while other units want to read from that page, and various other situations that are common to functions that share stored data. As previously noted, standard communication protocols which have been developed in the past for loosely-coupled distributed computer systems can impose significant overhead for communications in terms of the number of instructions necessary to transmit the data from one unit to another, thereby effectively cancelling out the potential benefits of having a shared virtual memory cluster of data processors.
Accordingly, the present invention is directed to a novel, lightweight communications protocol designed to allow trusted kernels of the operating systems of the processors in a clustered-type multi-processor system to communicate control information and data over a high-speed communications link.