1. Field of the Invention
The present invention relates to a cache coherency control for a multi-processor system in which multiple processors share a main memory.
2. Description of Related Art
A “snoop method” is known as a technique for ensuring a coherency among caches in a multi-processor system in which multiple processors share a main memory. In the snoop method, the caches of the processors “handshake” with each other, so that each processor grasps (obtains) a renewal of a data stored in the cache of each of the processors, thereby knowing in which cache the latest data exists and purging a line (i.e., in its cache) as necessary to be able to obtain the latest data, thereby to maintain a cache coherency.
As an access to a main memory, an ordinary processor supports both of an access via a cache and a direct access to the main memory. In a case of the access via the cache, a processing differs depending on whether the cache of the processor operates in a “write-through method” or in a “write-back method”.
The write-through method is a method in which a renewed data is not only stored to the cache, but also written back to the main memory at the same time when a CPU (Central Processing Unit) performs a write to the main memory.
The write-back method is a method in which, when the CPU performs a write to the main memory, the renewed data stays in the cache and is not written back to the main memory unless a condition is satisfied. The condition for writing back may be, for example, a case where a number of read/write operations for a frame address becomes equal to or more than a number of ways of the main memory, a case where another processor requests an access to a cache line to be written back, and the like.
In a multi-processor system including the processors each of which have caches operating in the write-back method, the snoop method uses an invalidation-type protocol.
In the invalidation-type protocol, in a case where a cache performs a renewal with respect to an address being looked up by a plurality of caches, cache lines of all caches, each of the cache lines corresponding to the address being looked up, are invalidated. The cache lines which are invalidated become a so-called “dirty” state. Thus, there is no existence of a state that old data is cached in other caches even though a line corresponding to the old data is renewed, and cache coherency is accomplished. As the invalidation-type protocol, there are an MESI protocol, an MOSI protocol, and the like.
For example, in a case of the MESI protocol, the cache in each processor is administered by being defined into the following four states.
State 1: A state that a target data which is the target of a command issued by the processor does not exist in the cache. This state is also hereinafter referred to as an “I (Invalid) state”.
State 2: The targeted data exists in the cache and is the same as a data which is stored in the main memory. Furthermore, the targeted data also exists in caches of other processors. This state is also hereinafter referred to as an “S (Shared-Unmodified) state”.
State 3: The targeted data exists only in the cache of one of the processors and is the same as the data which is stored in the main memory. This state is also hereinafter referred to as an “E (Exclusive) state”.
State 4: The targeted data exists only in the cache of one of the processors, and is different from the data which is stored in the main memory. In this state, the data which exists in the cache is the latest data that has not yet been written back to the main memory. This state is also hereinafter referred to as an “M (Modified) state”.
In the multi-processor system, a problem may arise in the cache coherency in a case where a plurality of the processors read data of the same cache line address (hereinafter the “cache line address” is simply referred to as an “address”) at a close time. A multi-processor system 1 as shown in FIG. 9 will be described as an example.
The multi-processor system 1 shown in FIG. 9 includes a plurality of (e.g., four CPUs are shown) CPUs 10 to 40, a shared bus 50, and a main memory 70.
CPUs 10 to 40 are connected to the shared bus 50, and may communicate with each other and may access the main memory 70 via the shared bus 50. The CPUs 10 to 40 respectively include a cache 12, a cache 22, a cache 32, and a cache 42, and these caches operate in the write-back method.
For example, when the CPU 10 reads data of an address (assumed to be an address A) in the main memory 70, if the data does not exist in the cache 12 of the CPU 10, a “cache miss” occurs. In this case, the CPU 10 outputs not only a read request to the main memory 70, but also a snoop request corresponding to the address A. The snoop request is received by all other CPUs connected to the shared bus 50.
Each CPU, which receives the snoop request, looks up the cache state of its own cache, and in a case where the data of the address A exists in its own cache, namely, in a case of a “cache hit”, the CPU having the cache hit, transfers the data to the CPU 10. In a case where the cache state of the CPU having the cache hit is the state 4 (e.g., M state), the CPU also writes back the data to the main memory 70.
It depends on the design of the system whether the cache in the state 3 (E state) or in the state 4 (M state) transits to (e.g., becomes) either of the state 1 (I state) or in the state 2 (S state).
Normally, the CPU reads the data from the cache or the main memory to renew the data. If the cache in the E state and the M state transits to the S state after outputting the data to another processor, then it is necessary to output a request for invalidating the cache lines, which corresponds to the outputted data, of other CPUs when each of the other CPUs which receives the data performs renewal of the data and stores the renewed data to the cache of its own.
Thus, a lot of traffic for the request for invalidating occurs on the shared bus 50, and the efficiency of the multi-processor system 1 deteriorates. Therefore, if the cache in the E state or the M state transits to the I state after outputting the data, then the efficiency of the multi-processor system is improved.
If the CPUs other than the CPU 10 get the cache miss when the CPU 10 gets the cache miss, request the missed data corresponding to an address “B”, for example, to the main memory 70, and send the snoop request to other CPUs, then CPU 10 reads the missed data from the main memory 70 according to the request which is issued to the main memory 70 by the CPU 10. A case where the CPU 20 gets the cache miss of the data corresponding to the address “B” during the CPU 10 reading the missed data, will be described below.
In this case, the CPU 20 sends the read request of the address B to the main memory 70 and sends the snoop request to the other CPUs. When the CPU 10 receives the snoop request from the CPU 20, the CPU 10 incurs the cache miss because the data is not yet stored in the cache of the CPU 10, and outputs a message to the CPU 20. The message indicates that the CPU 10 does not have the data corresponding to the address “B”. And, it is assumed that the CPU 30 and the CPU 40 also incur the cache misses and output the message to the CPU 20, with the message indicating that the CPU 30 and 40 do not have the data corresponding to the address “B”.
The CPU 10 continues the read operation for the data corresponding to the address “B”, and receives the data corresponding to the address “B” from the main memory 70. All of the CPU 20, CPU 30, and CPU 40 incur the cache misses with respect to the address “B” Thus, the CPU 10 stores the data corresponding to the address “B” in the cache in the E state (i.e., the CPUs other than the CPU 10 do not have the data corresponding to the address “B”).
On the other hand, the CPU 20 also receives the data of the address “B” from the main memory 70 because the CPU 20 requests the data corresponding to the address “B” to the main memory 70. The CPU 20 also stores the data corresponding to the address “B” in the E state because all of the CPU 10, CPU 30, and the CPU 40 incur the cache misses and output a message to the CPU 20. The message indicates that the CPUs 10, 30 and 40 do not have the data corresponding to the address “B”. In other words, from the viewpoint of CPU 20, CPU 20 recognizes that the data corresponding to the address “B” is only stored in the CPU 20 (just like CPU 10) even though both of the caches of the CPUs 10 and 20 have the data corresponding to the address “B”.
As a result, in the multi-processor system 1, the caches of the two processors (i.e., the CPU 10 and 20) have the data corresponding to the same address “B” in the E (Exclusive) state. This results in a breakdown of the cache coherency with respect to the address “B”.
This problem (condition) is caused by the fact that the CPU 10 replies “cache miss” in response to the snoop request from another CPU while reading the data from the main memory. Patent Document 1 discloses a method to solve this problem (section [0062] to [0066] in Patent Document 1).
In the method disclosed in the Patent Document 1, when each of the processors receives a snoop request from other processors, while each of the processors reads data corresponding to a certain address from the main memory, each of the processors sends an “RTY” signal, which indicates that the snoop request and the read request to the main memory are terminated and are to be retried again, to other processors which have sent the snoop request. Other processors which receive the “RTY” signal the retry the snoop request and the read request.
With the method described above, the processor is prevented from replying “cache miss” while the processor is reading the data from the main memory. Thus, the cache coherency is maintained.    [Patent Document 1] Japanese Patent Laid-Open No. 2003-150573
However, if the method disclosed in the Patent Document 1 is applied to the multi-processor system 1 shown in FIG. 9, then problems will arise in a case where a plurality of the processors read the same address at a close time (e.g., very close in time or almost simultaneously at a same time).
FIG. 10 shows an example of a timing chart of the case where the method disclosed in the Patent Document 1 is applied to the multi-processor system 1 shown in FIG. 9. In FIG. 10, SR, RR, RD, RTY denote “snoop request”, “read request”, “read data”, and “retry request”, respectively. T1, T2, . . . denote timings.
In FIG. 10, the snoop requests relating to the CPU 10 are described, and other snoop requests are omitted in FIG. 10. The “RTY”, which is outputted by the CPU 10 to the inter-coupling network when the CPU 10 receives the snoop request from other CPUs while reading data from the main memory 70, is omitted.
In the example shown in FIG. 10, at T0, the CPU 10 incurs the cache miss with respect to data corresponding to an address “C”. Accordingly, the CPU 10 outputs a read request RR10 to the main memory 70, and outputs a snoop request (SR102, SR103, SR104) to CPUs 20, 30, and 40. It is assumed that the CPU 20 to 40 also incur cache misses with respect to the data corresponding to the address “C”, and do not respond to the snoop request.
At T1, the read request RR10 from the CPU 10 is issued to the main memory 70 via the inter-coupling network.
At T2, the CPU 20 incurs a cache miss with respect to the data corresponding to the address “C”, and outputs a read request RR20 to the main memory 70 and outputs the snoop request. Thus, at T3, the CPU 10 receives the snoop request SR201, and the inter-coupling network receives the read request RR20.
Because the CPU 10 is reading the data corresponding to the address “C” from the main memory 70, the inter-coupling network outputs a retry request RTY20 to the CPU 20 at T4.
Thereafter, the CPU 30 and CPU 40, which incur cache misses with respect to the data corresponding to the address “C”, receive retry requests RTY30 and RTY40 from the inter-coupling network.
At T14, the main memory 70 outputs the data corresponding to the address “C” (read data RD10) to the CPU 10 via the inter-coupling network. The data is sent by the inter-coupling network to the CPU 10 at T15.
At T16, the CPU 10 receives the read data RD10, and stores the read data RD10 in the cache 12. Thus, the cache 12 of the CPU 10 transits (e.g., transitions, changes, etc.) from the state I to the state E.
In response to the retry request RTY20, the CPU 20 outputs the read request RR20 with respect to the data corresponding to the address “C” and the snoop request again at T17. At this time, the CPU 10 gets a cache hit in response to the snoop request sent from the CPU 20, and thus outputs the data RD10 to the CPU 20 as read data RD20A (T19). At this moment, the cache 12 of the CPU 10 transits (e.g., transitions, changes, etc.) to the state I (invalid) from the state E. In other words, the cache 12 of the CPU 10 transits to an Invalid state soon after the cache 12 transits to an Exclusive state.
When the CPU 20 receives the read data RD20A from the CPU 10 at T20, the CPU 20 stores the read data RD20A in the cache 22.
The main memory 70 also outputs read data RD20B to the CPU 20 at T21 in response to the read request RR20 but the CPU 20 discards the read data 20B because the latest data, which is sent from the CPU 10, is already stored in the cache of the CPU 20.
The read data RD10 is stored in the cache 12 of the CPU 10 only in a period T16 to T18. After the period T16 to T18, the read data RD10 becomes invalid because the cache 12 transits to the I state from the E state.
A reason why the CPU 10 reads the data is usually to renew the read data. If the CPU 10 outputs the read data to another processor before renewing and storing the data in the cache, then the state of the cache returns back to the I (invalid) state, thereby incurring a cache miss again even though the CPU 10 read the data from the main memory. Accordingly, a handshake with another CPU is needed to complete the renewal of the read data, thus resulting in a longer latency. As a result, the processing efficiency in the system deteriorates.