Cache memories have been used to improve processor performance, while maintaining reasonable system costs. A cache memory is a very fast buffer comprising an array of local storage cells used by one or more processors to hold frequently requested copies of data. A typical cache memory system comprises a hierarchy of memory structures, which usually includes a local (L1), on-chip cache that represents the first level in the hierarchy. A secondary (L2) cache is often associated with the processor for providing an intermediate level of cache memory between the processor and main memory. Main memory, also commonly referred to as system or bulk memory, lies at the bottom (i.e., slowest, largest) level of the memory hierarchy.
In a conventional computer system, a processor is coupled to a system bus that provides access to main memory. An additional backside bus may be utilized to couple the processor to a L2 cache memory. Other system architectures may couple the L2 cache memory to the system bus via its own dedicated bus. Most often, L2 cache memory comprises a static random access memory (SRAM) that includes a data array, a cache directory, and cache management logic. The cache directory usually includes a tag array, tag status bits, and least recently used (LRU) bits. (Each directory entry is called a “tag”.) The tag RAM contains the main memory addresses of code and data stored in the data RAM plus additional status bits used by the cache management logic.
Today, many integrated circuit manufacturers are designing chips with multiple processing cores, also known as chip multiprocessors or CMP. The basic idea of CMPs is to extract Thread Level Parallelism, once Instruction Level Parallelism enters the territory of diminishing returns. Increasing the number of processing elements on a chip starts to place severe demands on memory bandwidth because of the many execution contexts that could be all running simultaneously. The memory bandwidth is pin-limited, with the number of pins connecting a chip to the memory chip not growing at the same rate as the number of transistors on a chip nor the number of processors on the chip. Therefore, the bandwidth to memory is starting to become a performance bottleneck.
To alleviate the memory bandwidth bottleneck, large on-die cache memories are needed. Large on-die cache memories are typically subdivided into multiple cache memory banks, which are then coupled to a wide (e.g., 32 bytes, 256 bits wide) data bus. In a very large cache memory comprising multiple banks, one problem that arises is the large resistive-capacitive (RC) signal delay associated with the long bus lines when driven at a high clock rate (e.g., 1 GHz). Further, various banks of the cache may be wired differently and employ different access technologies.
In NUCA caches, the latency to a bank generally depends on the proximity to the device making the request, which frequently is a core or a processor. NUCA takes advantage of the faster response times of banks closer to the processor and allows farther banks to respond slower.