Multiprocessor, or parallel processing, computer systems rely on a plurality of microprocessors to handle computing tasks in parallel to reduce overall execution time. One common implementation of a multiprocessor system is the “single bus architecture, in which a plurality of processors are interconnected through a single bus. However, because of the limited bandwidth of the single bus also limits the number of processors that can be interconnected thereto, recently networked multiprocessor systems have also been developed, which utilize processors or groups of processors connected to one another across an interconnection fabric, e.g., a network, and communicating via “packets” or messages.
Typically, in a networked multiprocessor system includes a plurality of nodes or clusters interconnected via a network. For example, FIG. 1 shows an exemplary networked multiprocessor system 100, in which a plurality of nodes 102 are interconnected to each other via the interconnection fabric 101, e.g., a network. By way of an example, only two nodes are shown. However, the networked multiprocessor system 100 may have any number of nodes. Moreover, although, in FIG. 1, the interconnection fabric 101 is shown to provide interconnections only between the nodes 102, all system entities, including the cells 103, the processors 105 and the memories 104, are interconnected, and communicate, with the rest of the system through the interconnection fabric 101.
Each of the nodes 102 of the networked multiprocessor system 100 may be further divided into a smaller hierarchical units—referred herein as “cells” 103—, which comprises a plurality of processors 105 and a shared memory 104. Each processor 105 may comprise any processing elements that may share data within the distributed shared memory in the system, e.g., a microprocessor, an I/O device or the like. The grouping into nodes and/or cells of the system entities may be made physically and/or logically.
Each of the shared memory 104 may comprise a portion of the shared memory for the system 100, and may include a memory controller and/or a coherency controller (not shown) to control memory accesses thereto from various processors in the system, and to monitor the status of local copies of the memory stored in caches in various processors in the system using a coherency directory that are maintained in each node or within each cell. As shown in FIG. 2, a typical shared memory 104 receives memory request packets through the blocking (BL) queue 204, receives response packets from processors 105 through the processor return (PR) queue 203, and sends memory return packets in response to the memory requests through the MR queue 202. The PL/BL queue 201 is a buffer to hold incoming memory requests, where a BL transaction is a transaction involving a memory access request to the shared memory 104, e.g., a memory read transaction, and where a PR transaction is a transaction involving a response from a processor 105 in response to a coherency check request from a shared memory 104 and/or a write back of a copy of a cache line in its cache back to the memory 104.
One significant problem that exists with many networked computer systems is that of deadlock, where nodes may in effect “lock up” due to an inability to pass packets or messages to other nodes. In particular, in some networked computer systems, sending a primary request to another node may result in the receiving or destination node sending out one or more secondary requests, e.g., to notify other nodes having local copies of a memory block that the copies of the memory block in their respective caches must be marked invalid (often referred to as “invalidation” requests). However, if one or more secondary requests cannot be sent by the destination node, the node may block receipt of new requests from other nodes. This may result in two nodes each waiting for responses from the other.
For example, if a shared memory 104 receives a BL transaction, e.g., a memory read request from a processor, the memory 104 may send a MR request, e.g., a coherency check request, to other processor(s) to ensure that no copies of the requested data reside in caches of other processor(s). The shared memory 104 then must wait for processor responses (PR) from the other processor(s) before it can issue a MR transaction that satisfies the memory read request. However, if the MR transaction to other processors could not be issued, e.g., because the MR queue is full at the time, then no other transaction can be received by the shared memory 104, and thus a deadlock occurs.
For optimal performance of the shared memory system, it is crucial that the exchange of packets among the processors and shared memories continuously flow.
Prior attempts to address the deadlock problem includes provision of MR queues which can hold as many memory return transactions as BL transactions in the PR/BL queue may generate. Unfortunately, however, this approach requires a large buffer area. The control logic necessary to control this larger buffer area also becomes very large and complex, and thus is more difficult to design and too slow to operate.
Moreover, because the number of PR transactions and the number of BL transactions that can be processed at any given time are each fixedly arranged and independent with respect to each other, a conventional shared memory system cannot dynamically adapt to process more of BL transactions or more of the PR transactions as the realtime need may require, and is thus inflexible.
Another prior attempt to address the deadlock problem is to provide a special entry type dedicated to handle write-back PR transactions. This approach requires an additional design effort, which is difficult, often “bug prone”, and the resulting design is often very inflexible.
Moreover, conventional memory access transaction queues allow only one MR transaction to be added per clock cycle, and thus are inefficient.
Thus, there is a need for more efficient method and device for memory access in a distributed shared memory system, which prevents occurrences of deadlocks, which does not require a large MR queue, and which does not require a design of a special dedicated entries for write back transactions.
There is also a need for more flexible method and device for memory access in a distributed shared memory system that may be dynamically adaptive to the current requirement for processing various memory access transactions. There is also a need for a distributed queuing mechanism that allows multiple addition of entries in a clock cycle.