This application is related to co-pending U.S. patent application Ser. No. 09/041,568, entitled xe2x80x9cCache Coherence Unit for Interconnecting Multiprocessor Nodes Having Pipelined Snoopy Protocol,xe2x80x9d filed on Mar. 12, 1998 now pending; co-pending U.S. patent application Ser. No. 09/003,771, entitled xe2x80x9cMemory Protection Mechanism for a Distributed Shared Memory Multiprocessor with Integrated Message Passing Support,xe2x80x9d filed on Jan. 7, 1998 now pending; co-pending U.S. patent application Ser. No. 09/003,721, entitled xe2x80x9cCache Coherence Unit with Integrated Message Passing and Memory Protection for a Distributed, Shared Memory Multiprocessor System,xe2x80x9d filed on Jan. 7, 1998 now pending; and co-pending U.S. patent application Ser. No. 09/281,714 xe2x80x9cSplit Sparse Directory for a Distributed Shared Memory Multiprocessor System,xe2x80x9d filed on Mar. 30, 1999, which are hereby incorporated by reference now pending.
1. Technical Field
This invention relates generally to computer network messaging and more particular to avoiding deadlock while controlling messages in a multi-node computer network.
2. Discussion of Background Art
In multi-node computer networks, nodes communicate with each other by passing network messages through an interconnect. These network messages support different forms of communication between nodes, depending on the nature and requirements of the network. In parallel processing systems, for example, the network messages specifically support cache-coherence communication in shared-memory multiprocessor systems, and support message-passing communication in distributed-memory multi-computer systems. Frequently, a single computer system supports more than one form of message communication.
For a network to operate correctly, it is important to prevent deadlock while controlling network messages. In general, deadlock occurs when all four of the following conditions are met: (1) mutual exclusion in which a resource is assigned to one process; (2) hold and wait in which resources are acquired incrementally and processes may hold one resource while waiting for another; (3) no preemption, in which allocated resources cannot be forcibly acquired by another process; and (4) circular wait in which two or more processes form a circular chain of dependency with each process waiting for a resource held by another.
In the context of network messaging, xe2x80x9cresourcesxe2x80x9d are defined as the buffer spaces available to hold network messages while in transit from one node to another node and xe2x80x9cprocessesxe2x80x9d are defined as the nodes which generate and consume the network messages. When deadlock occurs, some nodes in the network are unable to make progress (i.e., service the network messages). Without appropriate recovery measures, the network must initiate a reset or interrupt, which may result in a loss of messages and cause damage to the system as a whole.
Deadlock may be dealt with by any of several techniques including prevention, avoidance, and detection and recovery. Prevention techniques remove one of the four conditions described above, thereby making it impossible for a deadlock to occur. Avoidance techniques check for the deadlock conditions before allocating each resource, and allow the allocation only if there is no possibility of deadlock. Detection and recovery techniques do not prevent or avoid deadlock, but detect deadlock situations after they occur and then recover from those deadlock situations.
One common technique for avoiding deadlock provides two separate interconnects, or two separate channels within the same interconnect, for request and reply messages. In this technique, a node guarantees sufficient buffering for its reply messages by limiting the number of requests it has outstanding. An example of this is described in ANSI/IEEE Std. 1596-1992, Scalable Coherence Interface (SCI) (1992). In networks that only allow a simple request-reply messaging protocol, this technique is sufficient to avoid deadlock.
With more sophisticated messaging protocols, such as those that allow request forwarding, the two-interconnect technique may be extended by increasing the number of interconnects. However, the number of required independent interconnects corresponds to the maximum length of the dependence chains in the messaging protocols.
Another technique, described by Lenoski and Weber in Scalable Shared-Memory Multiprocessing (1995), allows request forwarding messaging with two separate interconnect channels, but couples the two channels with a back-off mechanism in the messaging protocol. When a potential deadlock situation is detected the back-off mechanism reverts to a request-reply transaction by sending a negative acknowledgement reply to all requests which need forwarding until the potential deadlock situation is resolved.
However, requiring separate interconnects or interconnect channels for request and reply messages imposes additional overhead on the interconnection network and its management structures. The multiple-interconnect techniques also impose complexity on the messaging protocol because messages on the separate interconnects can not be ordered with respect to one another and simplifying assumptions about message ordering cannot be made. Having back-off mechanisms also adds complexity to the messaging protocol.
Another technique, which employs detection and recovery, involves extending the buffer space available when deadlock is detected. This has been implemented in the Alewife machine by Kubiatowicz and Agarwal, xe2x80x9cAnatomy of a Message in the Alewife Multiprocessor,xe2x80x9d Proceedings of the 7th International Conference on Supercomputing (1993). In the Alewife approach, a network interface chip signals an interrupt to the processor when its output queue has been blocked for some specified period of time. The processor then empties the input queue into local memory.
The Alewife approach emulates a virtually infinite buffer by augmenting the input queue in local memory whenever it overflows. But it does not address management of this buffer size. Moreover, Alewife relies on first detecting a potential deadlock situation and then resolving the deadlock situation in software by having the processor extend the queue into local memory. This is not always feasible because the processor may have outstanding requests that are caught in the same deadlock and, without special abort and fault recovery mechanisms, cannot service an interrupt until this deadlock has been resolved.
What is required, therefore, is a technique to avoid messaging deadlock that does not increase the interconnect-management overhead required to support separate interconnect channels for request and reply messages and that eliminates the complexities of back-off mechanism support and software-managed deadlock recovery.
The present invention provides a computer architecture for avoiding a deadlock while controlling messages between nodes in a multi-node computer network.
The invention, avoiding deadlock, inserts a buffer and associated control circuitry between the output of a node and the network in order to buffer all outgoing network messages from that node. Proper sizing of the buffer along with associated flow control circuitry guarantees sufficient buffering such that the buffer does not overflow and at least one in a group of nodes involved in a circular wait is always able to service incoming messages, thereby facilitating forward progress and avoiding deadlock.
To effectively manage the buffer, network messages are classified into preferably three types based on their service requirements and messaging protocols. In the preferred embodiment, these message types are called reliable transaction messages, posted messages, and unreliable transaction messages. The invention reserves a quota for each of the message types in the buffer and, based on this quota, controls the number of network messages of each type that is outstanding at any one time. The total buffer size is the sum of the space requirements of these message types. In one embodiment the invention further differentiates within each type of message to allow a more precise space requirement. Other considerations to reduce the buffer size are also incorporated.
The architecture of the invention thus includes a buffer for receiving outgoing messages to the interconnect, a decoder for decoding the messages into different message types, and an output controller coupled to the buffer and to the decoder for controlling the passage of the messages through the buffer.
The invention avoids the interconnect-management overhead required to support separate request and reply channels. The invention also eliminates the complexities in the messaging protocols that support message back-off and those associated with software-managed deadlock recovery procedures.