It is known that, in order to overcome the limitations of scalability of symmetrical multi-processor architectures (several processors connected to a system bus by means of which they have access to a shared memory), amongst various solutions, a new type of architecture defined as "cache-coherent, non-uniform memory access" architecture has been proposed.
This modular architecture is based on the grouping of the various processors in a plurality of "nodes" and on the division of the working memory of the system into a plurality of local memories, one per node.
Each node thus comprises one or more processors which communicate with the local memory by means of a local bus. Each node also comprises a bridge for interconnecting the node with other nodes by means of a communication channel in order to form a network of intercommunicating nodes.
The communication channel, which is known per se, may be a network (a mesh router), a ring, several rings, a network of rings, or the like.
Each processor of a node can access, by means of the interconnection bridge and the communication channel, data held in the local memory of any of the other nodes, which is regarded as remote memory, by sending a message to the node, the memory of which contains the required data.
Whereas operations by a processor to access the local memory in the same node are fairly quick and require only access to the local bus and the presentation, on the local bus, of a memory address, of a code which defines the type of operation required and, if this is writing, the presentation of the data to be written, in the case of data resident in or destined for other nodes, it is necessary, as well as accessing the local bus, to activate the interconnection bridge, to send a message to the destination node by means of the communication channel, and by means of the interconnection bridge and the local bus of the destination node to obtain access to the memory resources of the destination node which supplies a response message including the data required where appropriate, by the same path.
Even if they are carried out by hardware without any software intervention, these operations take much longer (even by one order of magnitude) to execute than local memory-access operations.
For this reason, architecture of this type is defined as "NUMA" architecture.
It is advisable to reduce access time as much as possible, both in the case of local memory access and in the case of access to the memories of other nodes.
For this purpose, the various processors are provided, in known manner, with at least one cache and preferably two associative cache levels for storing blocks of most frequently-used data which are copies of blocks contained in the working memory.
Unlike the local memories which, for cost reasons, are constituted by large-capacity dynamic DRAM memories, the caches are implemented by much faster static "SRAM" memories and are associative (at least the first-level ones are preferably "fully associative").
A problem therefore arises in ensuring the coherence of the data which is replicated in the various caches and in the local memories.
Within each node this can be achieved very simply, in known manner, by means of "bus watching" or "snooping" operations on the local bus and the use of suitable coherence protocols such as, for example, that known by the acronym MESI.
However, the first- and second-level caches associated with each processor of a node may also contain data resident in the local memory or in the caches of other nodes.
This considerably complicates the problem of ensuring the data coherence.
In fact any local data resident in the local memory of a node may be replicated in one or more of the caches of other nodes.
It is therefore necessary for every local operation which modifies or implicitly invalidates a datum in the local memory of the node (by modification of a datum which is in a cache and which is a replica of data resident in the local memory) to be communicated to the other nodes in order to invalidate any copies present therein (it is generally preferred to invalidate copies rather than updating them since this operation is simpler and quicker).
To avoid this burden which limits the performance of the system, it has been proposed (for example, in Proceedings of the 17th Annual International Symposium on Computer Architecture, IEEE 1990, pages 148-159: D Lenosky, J. Laudon, K. Gharachorloo, A. Gupta, G. Hennessy "The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor") to associate with every local memory a directly mapped "directory" which is formed with the same technology as the local memory, that is DRAM technology, and which specifies, for each block of data in the local memory, whether and in which other nodes it is replicated and possibly whether it has been modified in one of these nodes.
As a further development, to reduce the size of the directory and to increase its speed, it has been proposed to form this directory as an associative static SRAM memory.
Only the transactions which require the execution of coherence operations are thus communicated to the other nodes.
On the other hand, it is necessary to bear in mind that a datum stored in the local memory of one node may be replicated in a cache of another node and may be modified therein.
It is therefore necessary, when the modification takes place, for the operation to be indicated to the node in which the local memory is resident in order to update the state of the directory and possibly to invalidate copies of the data resident in the cache.
The use of the directory associated with the local memory ensures the coherence of the data between the nodes; these architectures are therefore defined as cc-NUMA architectures.
However, the use of a directory associated with the local memory does not solve the problem of speeding up access to data resident in the local memory of other nodes and thus improving the performance of the system as a whole.
To achieve this result, use is made of a so-called remote cache (RC) which stores locally in a node the blocks of data most recently used and retrieved from remote memories, that is, from the local memories of other nodes.
This remote cache, which has to serve all of the processors of a node, is a third-level cache additional to the caches of the various processors of the node.
Known systems with cc-NUMA architecture therefore integrate this remote cache as a component associated with the interconnection bridge or remote controller of the node with the consequence that the remote cache is fast but of limited capacity if implemented as a static SRAM memory, or of large capacity but slow both in executing the access operation and in validating/invalidating it, if implemented as a DRAM.
It has also been proposed to implement the remote cache with a hybrid structure, as DRAM for storing blocks of data and as SRAM for storing the "TAGS" identifying the blocks and their state, so as to speed up the validation/invalidation of the access operations and the possible activation of the exchange of messages between nodes, if required.
However, the implementation of the remote cache as an independent memory also requires the support of a dedicated memory control unit and is inflexible because, although the memory capacity can be configured within the design limits and is predetermined at installation level, it depends on the number of memory components installed and is not variable upon the initialization (booting) of the system in dependence on user requirements which may arise at any particular time.