The present invention relates to a computer memory access mechanism for distributed shared-memory multiprocessors, and relates more particularly to a non-inclusive memory access mechanism.
In the present invention the local memory in each node of shared memory is utilized as a backing store for blocks discarded from the processor cache to delay the address binding to the local memory until the blocks are discarded from the processor cache. Such avoids enforcement of the inclusion property and long latency due to the inclusion property.
Cache-coherent non-uniform memory architecture (CC-NUMA), such as the Stanford DASH, and cache only memory architecture (COMA), such as the Kendall Square Research KSR-1, are two well-known Distributed Shared Memory (DSM) architectures. Under hardware control, both such architectures provide dynamic data migration and replication at processor caches. The difference between COMA and CC-NUMA is that COMA organizes each node's local memory as a cache to shared address space without traditional main memory, while CC-NUMA utilizes each node's local memory as a portion of shared main memory.
COMA provides dynamic migration and replication of memory blocks at the local memory under hardware control. There is some redundancy in a COMA machine between the processor cache and the local memory because there is an over-lapping of functionality (an inclusion property). COMA treats the memory local in each node, called attraction memory (AM), as a cache to the shared address space and binds the address of data to the AM block frame when the data are brought to a node as in the traditional processor cache. Such a system which organizes the local memory as a cache has a disadvantage of requiring the overhead of tag storage and additional unallocated storage space for replicated data.
Also, the AM in a COMA machine needs coherence enforcement which may create larger overhead than in traditional multiprocessors because of its huge size and the lack of traditional memory. Due to memory block placement in such a machine, when a miss happens at an AM, a block may need to be replaced to make space for the block coming from a remote node to satisfy the miss. Since the replaced block may be the last valid copy in the system, in order to avoid disk write-back, it is desirable to have the block be relocated to some other AM that has space for it. Finding a node whose AM has space for the replaced block can create a serious problem, although a hierarchical directory scheme can provide a deterministic searching mechanism.
In COMA machines, the size of a typical AM is relatively large compared to traditional cache memory. Although this huge size of the AM tends to eliminate capacity misses, it may also create more coherence activity, longer memory access latency, and increased memory overhead. COMA machines generally utilize an invalidation policy for enforcing coherency. Also, the inclusion property is generally applied between the processor cache and its AM as in multilevel processor caches: i.e., the cache of a processor is always a subset of its AM. As such, a memory block invalidation at the AM causes any corresponding data in the processor caches of the node to be invalidated.
With the large block size of AM in COMA machines, as compared to traditional processor cache memory, the inclusion property creates premature invalidations at the caches and may offset the advantage of low capacity misses provided by AM. Such can cause excessive coherence traffic due to higher probability of memory block being in a "shared" state. With the invalidation policy, this will limit the attainable hit rate at the AM.
Accesses to a slow AM introduce long inter/intra-node communication latency. By broadcasting invalidation signals to the processor cache when a block in the AM is replaced or by attaching an inclusion bit, which indicates whether or not a copy of the data is in the processor cache or not, to each AM block entry, the inclusion property can be enforced. However, both of the schemes generate intra-node traffic which make longer the critical path of the memory access latency.
By contrast, in CC-NUMA machines there is no inclusion property between the processor cache and its local memory. In most cases, the shared data in the processor cache, which will be invalidated, do not exist in the local memory of a CC-NUMA machine. This saves memory space and reduces the number of accesses to the slow local memory in CC-NUMA machines as compared to those accesses in COMA machines.
COMA has the disadvantage of having longer latency for some memory accesses than CC-NUMA; in particular, higher network latency has been observed to be incurred by the hierarchical directory scheme in COMA. Although non-hierarchical COMA-F can reduce the network latency overhead, the data accesses caused by cache misses can still have longer latency than in CC-NUMA because of the time required for checking the tag of the AM that is necessary for the accesses.
Moreover, the inclusion property enforced between the processor cache and its AM causes frequent accesses to the AM. For example, consider the following situation in view of the simple configuration shown in FIG. 1A-1C, which show two nodes (P.sub.i and P.sub.j) at three different sequential points in time, wherein each node includes a single processor, 2 processor cache blocks and 4 AM cache blocks.
Initially (at t.sub.0 see FIG. 1A), at Node.sub.i the processor cache of processor P.sub.i is using memory blocks A and D while its 4-block AM contains memory blocks A-D, and at Node; one of the processor cache blocks being used by processor P.sub.j is block E, while one block of its AM is block E. If processor P.sub.i wants to dump its block A to read the shared block E at time t.sub.1, the block A in the Node.sub.i AM needs to be replaced by block E to maintain the inclusion storage of block E at Node.sub.i by copying E to the AM spot originally occupied by block A at Node.sub.i and then to the P.sub.i processor cache (see FIG. 1B). Further, if the AM block A to be replaced by block E is the last copy of that block on the system it must be relocated to a remote AM site which is free at another node, which requires at least two more AM accesses (a read and a write) to store block A at a remote location.
Moreover, as illustrated in FIG. 1C, if at time t.sub.2 the processor P.sub.j at node.sub.j writes to data in its block E processor cache, then all block E copies at the AM cache locations (at both nodes) must be invalidated under the write-invalidation policy as well as the P.sub.i processor cache block E, leaving the only valid copy of block E as the just written copy of block E at the P.sub.j processor cache. Access to the AM tag is necessary to invalidate the block E at node.sub.i. Using traditional schemes to enforce the inclusion property results in the critical path of the memory access latency increasing. Allowing direct access to the processor cache for the incoming message from the network can hide the extra AM accesses from the critical path of the memory access latency. However, the extra AM accesses can still make other memory access latency longer due to contention at the AM.
Because of the ill effects of the inclusion property in COMA machines, it may be advantageous to relax the inclusion property somewhat to improve their performance. However, if the inclusion property is relaxed just for read-only (shared) data, there is a limitation in taking advantage of CC-NUMA since the writable (shared) block can be in an invalidated state. Also, the read-only block may be replaced soon from the node since the processor cache is small. This may decrease the hit rate if the read-only is referenced shortly thereafter. If the writable shared block is allowed to reside in the processor cache without saving a copy in the AM, the coherence control may be complicated, since more state information needs to be controlled at the cache.
Further, if address binding of data to an AM frame occurs when a missing block is brought to the node (block-incoming time), the inclusion property may waste the AM space in the sense that a block, which will swapped or invalidated, uses up AM space. If the block is not bound to the AM and the block will not be invalidated in the near future, the block may be replaced from the node due to the small size of the processor cache. Thus, either way limits the utilization of the large caching space of the AM, thereby diminishing the advantage of COMA designed machines.
Accordingly, there is a need for a variation on the COMA design machine to avoid enforcement of the inclusion property and long latency due to the inclusion property.