The invention relates to computer processors and memory systems. More particularly, the invention relates to optimizing coherent memory access operations within multiprocessor computer systems having distributed shared memory architectures.
Multiprocessor, or parallel processing, computer systems rely on a plurality of microprocessors to handle computing tasks in parallel to reduce overall execution time. One common implementation of a multiprocessor system is the xe2x80x9csingle bus architecture, in which a plurality of processors are interconnected through a single bus. However, because of the limited bandwidth of the single bus also limits the number of processors that can be interconnected thereto, recently networked multiprocessor systems have also been developed, which utilize processors or groups of processors connected to one another across an interconnection fabric, e.g., a network, and communicating via xe2x80x9cpacketsxe2x80x9d or messages.
Typically, in a networked multiprocessor system includes a plurality of nodes or clusters interconnected via a network. For example, FIG. 1 shows an exemplary networked multiprocessor system 100, in which a plurality of nodes 102 are interconnected to each other via the interconnection fabric 101, e.g., a network. By way of an example, only two nodes are shown. However, the networked multiprocessor system 100 may have any number of nodes. Moreover, although, in FIG. 1, the interconnection fabric 101 is shown to provide interconnections only between the nodes 102, all system entities, including the cells 103, the processors 105 and the memories 104, are interconnected, and communicate, with the rest of the system through the interconnection fabric 101.
Each of the nodes 102 of the networked multiprocessor system 100 may be further divided into a smaller hierarchical unitsxe2x80x94referred herein as xe2x80x9ccellsxe2x80x9d 103xe2x80x94, which comprises a plurality of processors 105 and a shared memory 104. Each processor 105 may comprise any processing elements that may share data within the distributed shared memory in the system, e.g., a microprocessor, an I/O device or the like. The grouping into nodes and/or cells of the system entities may be made physically and/or logically.
Each of the shared memory 104 may comprise a portion of the shared memory for the system 100, and may include a memory controller and/or a coherency controller (not shown) to control memory accesses thereto from various processors in the system, and to monitor the status of local copies of the memory stored in caches in various processors in the system using a coherency directory that are maintained in each node or within each cell.
As shown in FIG. 2, a typical shared memory 104 receives memory request packets through the blocking (BL) queue 204, receives response packets from processors 105 through the processor return (PR) queue 203, and sends memory return packets in response to the memory requests through the MR queue 202. The PL/BL queue 201 is a buffer to hold incoming memory requests, where a BL transaction is a transaction involving a memory access request to the shared memory 104, e.g., a memory read transaction, and where a PR transaction is a transaction involving a response from a processor 105 in response to a coherency check request from a shared memory 104 and/or a write back of a copy of a cache line in its cache back to the memory 104.
One significant problem that exists with many networked computer systems is that of deadlock, where nodes may in effect xe2x80x9clock upxe2x80x9d due to an inability to pass packets or messages to other nodes. In particular, in some networked computer systems, sending a primary request to another node may result in the receiving or destination node sending out one or more secondary requests, e.g., to notify other nodes having local copies of a memory block that the copies of the memory block in their respective caches must be marked invalid (often referred to as xe2x80x9cinvalidationxe2x80x9d requests). However, if one or more secondary requests cannot be sent by the destination node, the node may block receipt of new requests from other nodes. This may result in two nodes each waiting for responses from the other.
For example, if a shared memory 104 receives a BL transaction, e.g., a memory read request from a processor, the memory 104 may send a MR request, e.g., a coherency check request, to other processor(s) to ensure that no copies of the requested data reside in caches of other processor(s). The shared memory 104 then must wait for processor responses (PR) from the other processor(s) before it can issue a MR transaction that satisfies the memory read request. However, if the MR transaction to other processors could not be issued, e.g., because the MR queue is full at the time, then no other transaction can be received by the shared memory 104, and thus a deadlock occurs.
For optimal performance of the shared memory system, it is crucial that the exchange of packets among the processors and shared memories continuously flow.
Prior attempts to address the deadlock problem includes provision of MR queues which can hold as many memory return transactions as BL transactions in the PR/BL queue may generate. Unfortunately, however, this approach requires a large buffer area. The control logic necessary to control this larger buffer area also becomes very large and complex, and thus is more difficult to design and too slow to operate.
Moreover, because the number of PR transactions and the number of BL transactions that can be processed at any given time are each fixedly arranged and independent with respect to each other, a conventional shared memory system cannot dynamically adapt to process more of BL transactions or more of the PR transactions as the realtime need may require, and is thus inflexible.
Another prior attempt to address the deadlock problem is to provide a special entry type dedicated to handle write-back PR transactions. This approach requires an additional design effort, which is difficult, often xe2x80x9cbug pronexe2x80x9d, and the resulting design is often very inflexible.
Moreover, conventional memory access transaction queues allow only one MR transaction to be added per clock cycle, and thus are inefficient.
Thus, there is a need for more efficient method and device for memory access in a distributed shared memory system, which prevents occurrences of deadlocks, which does not require a large MR queue, and which does not require a design of a special dedicated entries for write back transactions.
There is also a need for more flexible method and device for memory access in a distributed shared memory system that may be dynamically adaptive to the current requirement for processing various memory access transactions.
There is also a need for a distributed queuing mechanism that allows multiple addition of entries in a clock cycle.
In accordance with the principles of the present invention, a method of preventing a deadlock in a distributed shared memory system having a memory access request transaction queue having a plurality of queue slots comprises preventing a blocking flow control class transaction from being processed in at least one of the plurality of queue slots.
In addition, in accordance with the principles of the present invention, an apparatus for preventing a deadlock in a distributed shared memory system having a memory access request transaction queue having a plurality of queue slots comprises a coherency controller configured to prevent a blocking flow control class transaction from being processed in at least one of the plurality of queue slots.
In accordance with another aspect of the principles of the present invention, a method of providing a distributed memory return transaction queue in a distributed shared memory system having a memory access request transaction queue processing one or more entries comprises providing each of the one or more entries a plurality of addition lines, providing a global pointer that indicates a last occupied position in the distributed memory return transaction queue, providing a plurality of entry pointers each associated with the one or more entries, and allowing each of the one or more entries to add one or more memory return transaction at locations in the distributed memory return transaction queue based on the global pointer and associated ones of the plurality of entry pointers through the a plurality of addition lines during a clock cycle.