A shared memory processor (SMP) consists of processor nodes and memories combined into a scalable configuration. Each node has one or more processors and its local memory. Optionally, each node has a cache and a cache controller for accessing main memory efficiently and enforcing consistency. However, a shared memory SMP differs from a network of workstations because all nodes share the same global address space. Hence, software techniques for mapping the global address space into local addresses are typically not needed in a shared memory SMP. A shared memory SMP also has fast interconnection networks that are used to access the distributed memory and pass consistency information. In some systems, the physical memory is distributed; these machines are referred to as non-uniform memory access [NUMA] machines. In an exemplary system, each processor in a node generally has a write-through first-level cache, and a write-back second-level cache. If there is more than one processor per node, cache coherence between processors must be maintained within a node in addition to between nodes. However, other types of machines other than NUMA machines also exist.
As access to main memory is slow compared to the processor speed, hardware caches are necessary for acceptable performance. However, since all processors (and caches) share the same global access space, it is possible that two different caches will cache the same data line (address) at the same time. If one processor updates the data in its cache without informing the other processor in some manner, an inconsistency results, and it becomes possible that the other processor will use a stale data value. The goal of cache coherency is to enforce consistency to insure proper execution of programs in this parallel environment.
There are at least two major factors affecting cache mechanisms: performance and implementation cost. The need for greater performance is obvious. The programs designed for shared memory multiprocessors have very long execution times so any performance increase would be beneficial.
If the time to access main memory is too slow, performance degrades significantly, and potential parallelism is lost. Implementation cost is also an issue because the performance must be obtained at a reasonable cost. Implementation costs occur by adding additional coherence hardware, or by programming consistency enforcing compilers. In addition to these two major factors, there are four primary issues to consider when designing a cache coherence mechanism. First is the coherence detection strategy, which is how the system detects possibly incoherent memory accesses. Second is coherence enforcement strategy. This is how cache entries change to guarantee coherence (that is, updating or invalidating). Third is precision of block-sharing information, which is how sharing information for cache and memory blocks are stored. Fourth is caches block sizes, which are the size of a line in the cache, and how it further affects system performance.
Custom DRAM chips (CDRAM) were introduced to eliminate some of these inherent problems. However, while CDRAMs can have extremely high bandwidth on dedicated busses, they often have no logic execution ability, increasing the coherency problem.
Therefore, there is a need for maintaining coherency in large caches in a manner that address at least some of the problems of conventional maintenance of coherency in very large caches.