Large-scale multi-processors with a single address-base and coherent caches offer a flexible and powerful computing environment. The single address-base and coherent caches together ease the problem of data partitioning and dynamic load balancing. The single address-base and coherent caches also provide better support for parallelizing compilers, standard operating systems, and multiprogramming, thus enabling more flexible and effective use of the machine. Currently, many research groups are pursuing the design and construction of such multi-processors. As research has progressed in this area, two variants have emerged, namely Cache-Coherent Non-Uniform Memory Access (CC-NUMA) machines and Cache-Only Memory Architecture (COMA) machines.
Both CC-NUMA and COMA machines have a distributed main memory, a scalable interconnection network, and directory-based cache coherence. Distributed main memory and scalable interconnection networks are essential in providing the required scalable memory bandwidth. Directory-based schemes provide cache coherence, consuming only a small fraction of the system bandwidth without requiring message broadcasts. In contrast to CC-NUMA machines, COMA machines convert the per-node main memory into a large secondary or tertiary cache, which is also called Attraction Memory (AM). The conversion occurs by adding tags to cache-line size partitions of data in main memory. A consequence is that the location of a data item in the machine is decoupled from the physical address of the data items, and the data item is automatically migrated or replicated in main memory depending on the memory reference pattern.
An advantage of COMA machines is that they can reduce the average cache miss latency, because data is dynamically migrated and replicated at the main-memory level. However, there are also several disadvantages. First, allowing migration data at the memory level requires a mechanism to locate the data on a miss. To avoid broadcasting such requests, current machines use a hierarchical directory structure, which increases the miss latency for global request. Second, the coherent protocol is more complex because it needs to ensure that the last copy of a data item is not replaced in the attraction memory (i.e., main memory). Also, as compared to CC-NUMA machines, there is additional complexity in the design of the main-memory subsystem and in the interface to the disk subsystem.