A plurality of processors share a memory in a shared-memory multi-processor system. Therefore, copies of the same data block in the memory may exist at the same time in a plurality of cache memories. Thus, state information indicating the states of data blocks needs to be managed to execute data processing while maintaining the cache coherency.
Main examples of the states of the data blocks include shared (S; Shared), exclusive updated (clean) (E; Exclusive Clean), exclusive not updated (dirty) (M; Modified), and invalid (I; Invalid). The cache protocol including such the four states, i.e., M, E, S, and I is referred to as MESI. Hereinafter, the shared will be expressed as “S” or “Shared”, the exclusive not updated will be expressed as “E” or “Exclusive”, the exclusive updated will be expressed as “M” or “Modified”, and the invalid will be expressed as “I” or “Invalid”.
The state S is a state in which the data block to be processed is a read-only data block, and the referencing processor does not have a right to update the data block to be processed. The same data block as the data block to be processed may exist in another cache memory.
The state E is a state in which the same data block as the data block to be processed does not exist in other cache memories, and the referencing processor has a right to update the data block to be processed. The data block to be processed is not updated, and the data block coincides with the data block to be processed in the memory.
The state M is a state in which the same data block as the data block to be processed does not exist in other cache memories, and the referencing processor has a right to update the data block to be processed. The data block to be processed has been updated, and the content of the data block is different from that of the data block to be processed in the memory. Therefore, the data block to be processed is the only latest information.
The state I is a state in which the data block to be processed is valid and is not in the cache memory.
The state information of the data blocks in the cache memory is usually registered in cache tags with entries corresponding to the lines of the cache memory.
For example, it is assumed that a data block to be loaded exists in a cache memory in the state S when a load command is executed in a processor. In this case, the processor can use the data block as it is. However, even if the processor tries to execute a store command, the store command cannot be processed because of the state S, i.e. a state without the right to update the data block. Therefore, the state of the data block needs to be change into the state E or the state M, i.e. a state with an exclusive right. For example, when the state of the data block is in the state S, the processor transmits, to one or a plurality of other cache memories that hold the data block to be processed, a request for invalidating the data block to be processed in the other processors, and the processor itself makes a transition to an exclusive type.
When the processor executes a load command or a store command, it may be happen the processor does not have any data block to be loaded, etc. In this case, the processor needs to acquire the data block to be processed. However, the data block in the memory may not be the latest. More specifically, there is a possibility that one of the cache memories has the data block to be processed in the state M. In this case, the processor needs to search the data block in the state M to maintain the cache coherency.
As described, to perform the coherent control of the cache memory efficiently, it is important to recognize in which cache memory and in which state the data block to be processed exists. For this purpose, a snoop-based method and a directory-based method are known.
In the directory-based method, one of the nodes performs central management of the state information of one data block. The nodes are units including processors, memories, and controllers of the processors and the memories. The nodes further include directory storage units which hold directory information, i.e. information indicating to which cache memories and in which states the data blocks of the memories belonging to the nodes are fetched.
It is assumed that a processor has issued a request for a data block. A node to which the processor of the request source belongs will be called as a “local node”. A node to which a memory that includes the data block to be processed belongs will be called as a “home node”. In other words, the home node is a node including the directory storage unit that manages the data block to be processed. When a response is generated from another cache memory as a result of the request, the node to which the cache memory belongs will be called as a “remote node”.
The directory storage unit has information related to all data blocks fetched to the cache memory. The directory storage unit stores information, such as which cache memory fetches (or copies) a block data and whether there is a possibility that the data block is rewritten. The possibility of rewriting denotes that the data block has been already rewritten or will be rewritten.
In the directory-based method, the processor that has requested for the data block recognizes, from the directory storage unit of the home node that manages the data block, in which cache memory and in which state the requested data block exists. The home node that has received the request of data acquires the directory information from the directory storage unit and executes necessary processing. If the entries of the directory storage unit correspond one-to-one with the data blocks, the directory storage unit is often arranged on the memory. Whether the entries of the directory storage unit correspond to all data blocks in the memory depends on the implementation.
In the directory-based method, the performance of the processor can be improved by tuning the software to allocate the data of the memory as close to the processor that uses the data as possible. If data requested by a processor exists in the memory of the node of the processor, that is, if the local node and the home node coincide with each other, the request and the data do not have to be transmitted and received between nodes. Therefore, there is no latency caused by the transmission and reception, and the addition of the network can be reduced.
It is quite normal that a plurality of nodes use a data block. Even if the software is tuned as much as possible, the use of the same data block by a plurality of cache memories would not be prevented. Therefore, it would be significantly difficult to completely match the local node and the home node. However, in a network configuration in which the distances between the nodes are not all uniform and there are differences in the distances depending on the nodes, even if a plurality of nodes use the same data block, the process can be speeded up by reducing the distance between the processing node and the data by allocating the process to nodes in close distances and performing the tuning to put the data on one of the nodes.
However, access to the memory essentially requires much time. Therefore, if the direction information in the memory is to be read out, the latency in the directory access becomes a bottleneck for improving the performance of the processor. In reality, the directory information in the memory is read out to recognize the target node, and the request is transmitted to the node. Therefore, much time is required to process the request. A longer time is further required if the directory storage unit is under control of another node.
Specifically, this is equivalent to a case illustrated in FIG. 15, for example. FIG. 15 illustrates a case of acquiring data from a cache of a remote node R. In other words, based on a request from a local node L, a directory storage unit of a home node H (a node different from the local node L in this case) is accessed, and then a cache memory of the remote node R is further accessed. Meanwhile, FIG. 16 illustrates a case of acquiring data from a memory of the home node H. The processing time is significantly longer in the case illustrated in FIG. 15 compared to the case illustrated in FIG. 16.
Therefore, a directory cache storage unit that has only part of the directory information can be included to execute fast processing by the home node. The ability to quickly read out part of the directory information from the directory cache storage unit is effective in increasing the speed even if the home node is different from the local node. In other words, a request to another node can be transmitted without accessing the memory, and the process can be speeded up.
For example, there is a known technique in which in a computer system including an integrated directory and a processor cache, a line to be cached is instructed when a directory entry recorded in a cache memory subsystem is in a state of Modified, Exclusive, or Owned, and the absence of the directory entry indicates that the line is cached in a Shared or Invalid state.    Patent Document 1: Japanese National Publication of International Patent Application No. 2006-501546