The present invention relates generally to shared memory computer systems and more particularly relates to systems and methods for performing memory access arbitration among transactions in an arbitration queue.
Multiprocessor computer architectures are known in the art and are recognized as overcoming limitations of single processor systems in terms of processing speed and transaction throughput. Typically, such multiprocessor systems are xe2x80x9cshared memoryxe2x80x9d systems where multiple processors on a bus, or a number of busses, share a single global memory. In some shared memory multiprocessor systems, memory is uniformly accessible to each processor, which simplifies the task of dynamic load distribution. Processing of complex tasks can then be distributed among various processors in the multiprocessor system while data used in the processing is substantially equally available to each of the processors undertaking any portion of the complex task. Similarly, programmers writing code for typical shared memory systems do not need to be concerned with issues of data partitioning, as each of the processors has access to and shares the same, consistent global memory.
Many multi-processor systems suffer disadvantages in that system bandwidth and scalability are limited. Although multiprocessor systems may be capable of executing many millions of instructions per second, the shared memory resources and the system bus connecting the multiprocessors to the memory presents a bottleneck as complex processing loads are spread among more processors, each needing access to the global memory. As more processors are added to a system to perform complex tasks, the demand for memory access also increases. However, at some point, adding more processors does not necessarily translate into faster processing, i.e., typical systems are not fully scalable. The decrease in performance is generally due to the bottleneck created by the increased number of processors needing access to the memory and the transport mechanism, e.g. bus, to and from memory.
Alternative architectures are known which seek to relieve such bandwidth constraints. Computer architectures based on Cache Coherent Non-Uniform Memory Access (CCNUMA) are known in the art. CCNUMA architectures are typically characterized as having distributed global memory. Generally, CCNUMA machines include a number of processing nodes which are connected through a high bandwidth, low latency interconnection network. The processing nodes will generally include one or more high-performance processors, associated cache, and a portion of a global shared memory. Cache coherence, i.e., the consistency and integrity of shared data stored in multiple caches, is typically maintained by a directory-based, write-invalidate cache coherency protocol, as known in the art. To determine the status of caches, each processing node typically has a directory memory corresponding to its respective portion of the shared physical memory. For each line or discrete addressable block of memory, the directory memory stores an indication of remote nodes that are caching that same line.
One known implementation of the CCNUMA architecture is known as xe2x80x9cDASHxe2x80x9d (Directory Architecture for Shared memory), developed at the Computer Systems Laboratory at Stanford University. The DASH architecture, described in the Directory-Based Cache Coherence Protocol for the DASH Multiprocessor, Lenoski et al., Proceedings of the 14th Int""l Symp. Computer Architecture, IEEE CS Press, 1990, pp. 148-159, which is incorporated herein by reference, consists of a number of processing nodes connected through a high-bandwidth, low-latency interconnection network. As is typical in CCNUMA machines, the physical memory is distributed among the nodes of the multiprocessor, with all memory accessible to each node. Each processing node consists of: a small number of high-performance processors; their respective individual caches; a portion of the shared-memory; a common cache for pending remote accesses; and a directory controller interfacing the node to the network.
The DASH system places a significant burden relating to memory consistency on the software developed for the system. In effecting memory consistency in the DASH implementation of CCNUMA architecture, a xe2x80x9crelease consistencyxe2x80x9d model is implemented, which is characterized in that memory operations issued by a given processor are allowed to be observed and completed out of order with respect to other processors. Ordering of memory operations is only effected under limited circumstances. Protection of variables in memory is left to the programmer developing software for the DASH multiprocessor, as under the DASH release consistency model the hardware only ensures that memory operations are completed prior to releasing a lock on the pertinent memory. Accordingly, the release consistency model for memory consistency in DASH is a weakly ordered model. It is generally accepted that the DASH model for implementing memory correctness significantly complicates programming and cache coherency.
A problem in multi-processor, shared memory systems is that memory access among the multiple processors must be controlled in a manner such that data read from and written to memory does not become corrupted or incoherent. Because the multiple processors may seek to perform conflicting operations on memory locations, such as simultaneously read from and write to a particular location, it is imperative that a memory management scheme be employed. Memory arbitration schemes for performing such memory and cache management are known. For example, a basic arbitration scheme may simply involve a first in-first out (FIFO) buffer which manages memory access by always giving priority to the oldest entry in the buffer.
While a FIFO scheme is effective at avoiding memory conflicts, it does have attendant disadvantages. For example, the type of operation in the respective buffer entries is not given any weight in this arbitration scheme. As a result, it is possible to have alternating read and write requests throughout the buffer which, as they are serviced in turn, require the memory bus to be frequently xe2x80x9cturned aroundxe2x80x9d (changed from read to write) which is a time consuming and inefficient operation. Another disadvantage is that if the resource required to service the oldest entry in the buffer is unavailable during the current cycle, all other operations must still wait there turn in the FIFO buffer even if all conditions to perform their respective operations are satisfied. Thus, system latency increases in such a system.
Accordingly, as processors become faster and multiprocessor shared memory systems become more complex, there is a growing need for improved systems and methods for memory management including new arbitration schemes and circuits.
It is an object of the invention to provide a system for management of distributed shared memory which provide enhanced performance with respect to system bandwidth and latency.
It is a further object of the present invention to provide a memory arbitration scheme which reduces memory bus turn around while not adversely effecting system latency.
It is yet another object of the present invention to provide an arbitration queue where entries can be serviced from any point in the queue and that higher order entries ripple down to fill the voids in the queue created by previously serviced entries.
In accordance with the present method of memory arbitration in a system including shared system memory, cache memory and at least one processor submitting transactions to the system memory, the arbitration process includes placing memory transactions in entries in an arbitration queue. The status of the entries with respect to the cache is determined prior to selecting a transaction to be serviced from the queue. Entries are then selected to participate in arbitration based at least in part upon the cache status. For example, if the transaction status is invalid, that transaction cannot be serviced until a write back from cache to system memory is complete. If the status indicates a cache hit, that entry can participate in arbitration and, if selected, can be serviced from cache.
In accordance with another aspect of the present method of memory arbitration, before conducting arbitration the entries in the arbitration queue are grouped according to at least one transaction parameter. Arbitration can then proceed among the groups to select one group of entries for servicing. From the selected group, transactions are preferably serviced from oldest to newest. Preferably, the transaction parameters are selected to optimize bandwidth and latency. Parameters can include memory bank, write to bank, read from bank, read, write and the like.
In the present arbitration methods, transactions can be serviced from any location in the arbitration queue. As a result, openings at intermediate positions in the queue can occur. To efficiently utilize the full capacity of the queue, a collapsible queue arrangement can be used.
In accordance with one embodiment of a collapsible arbitration queue, a number of registers corresponding to the number of entries in the queue are employed. A plurality of 2:1 multiplexers are interposed between the registers such that one multiplexer is interposed between a higher order register and a subsequent register with the output of the higher order register being coupled to a first input of the one multiplexer and the output of the subsequent register being coupled to a second input of the one multiplexer. An output of the one multiplexer is coupled to the subsequent register and a Mux control line is coupled to the one multiplexer to direct the contents of one of the first and second multiplexer inputs to the multiplexer output. In this way, the multiplexer select line associated with the higher order register and subsequent register determines whether the subsequent register is refreshed with its current contents or receives the contents of the higher order register.
These and other objects and features of the invention will become apparent from the detailed description of preferred embodiments in conjunction with the accompanying drawings.