In order to enhance the processing throughput of computer processor systems, processors typically employ a series of cache memory nodes to provide low latency access to blocks of memory stored therein, as compared to the latency to access main memory. The cache memory nodes hold copies of frequently accessed memory blocks from main memory.
In a system with multiple processors and multiple cache memory nodes it is possible that there exist within the system multiple copies of the same memory block. Two examples of such systems are shown in FIGS. 1A and 1B. Systems such as these require a mechanism for keeping the multiple copies of each memory block synchronised, such that when the value stored in a particular copy of a particular memory block in a particular cache memory node is changed, that change is reflected in all other cache memory nodes and main memory nodes in the system that hold a copy of said particular memory block.
Often a coherency protocol, such as Modified, Owned, Shared, Invalid (MOSI), is used to manage such synchronisation, maintaining consistency between cache memory nodes and main memory nodes. One method the coherency protocol may use to accomplish this is to delay the realisation of a change to a value in a cache memory node until it has deleted all other copies of the memory block containing the value from other cache memory nodes in the system, stalling the processor node connected to that cache memory node.
In order for the coherency protocol to function, it needs a way of discovering which memory blocks are stored in each cache memory node in the system. Often a directory node is used to provide the coherency protocol with a list of the memory blocks held in each cache memory node.
As the number of processor nodes in a multi-processor system increases, the directory node becomes a bottleneck, as every memory access performed by any cache memory node in the system needs to reference the directory node, limiting the performance of the system. This in turn limits the maximum number of processors that can operate efficiently in the system. Thus a need exists to increase the capacity of the directory node.
US2003/0005237 provides a processor-cache operational scheme and topology within a multi-processor data processing system having a shared lower level cache (or memory) by which the number of coherency busses is reduced and more efficient snoop resolution and coherency operations with the processor caches are provided. As illustrated in FIG. 2, L2 cache 209 includes a copy 207A′, 207B′, 207C′ and 207D′ of each L1 directory 207A, 207B, 207C and 207D. Precise images of L1 directories 207A, 207B, 207C and 207D are maintained whenever the L1 cache 205A, 205B, 205C and 205D are modified, either by the local processor operations or other external operations. The illustrated cache configuration and coherency operational characteristic eliminates the need to issue coherency operations (e.g. snoops) directly to the L1 directories within the processor modules.
Although US2003/0005237 does provide a performance improvement over a system that needs to issue snoops, it does contain a significant weakness. The L2 cache, 209, only contains a single copy of each L1 directory, therefore can only service a request from one processor node at a time. For example, if all four processor nodes 201A, 201B, 201C and 201D were to issue coherent memory requests A0, A1, A2 and A3 at the same time, the requests would need to be queued and serviced one at a time. For this reason the L2 cache may become a performance bottleneck.