1. Field of the Invention
The present invention relates to a device, which controls contents of a cache memory provided corresponding to each of processors of a multiprocessor system. In particular, the invention relates to a cache device controlling contents of a cache memory in accordance with a predetermined protocol.
2. Description of the Background Art
A multiprocessor system can execute intended processing at a high speed. More specifically, the intended processing is divided into a plurality of tasks, which can be executed in parallel, and each processor executes the task assigned thereto so that the processing proceeds in parallel. In this case, data communication is performed between the processors if data such as a variable is commonly used in tasks, which are executed by different processors, respectively. The above data, which is commonly used by the different processors (tasks) and is stored in a shared memory to be described later, is referred to as “shared data”.
If data communication between the processors is frequently performed in the multiprocessor system, it is appropriate to perform the communication via a memory, which is shared by the processors, and will be referred to as a “shared memory” hereinafter. Among systems of coupling the shared memory and each processor, a shared bus system requires the most simple and inexpensive hardware structure. In this shared bus system, a bus for accessing the shared memory is shared by a plurality of processors. If confliction (which will be referred to as “bus confliction” hereinafter) in data communication on the shared bus does not occur, a cost of the access to the shared memory is low. However, increase in number of the processors causes the bus confliction, and therefore increases overhead of data communication. A snoop cache has been proposed as a method of significantly reducing the bus confliction, and is now practically employed in many multiprocessor systems.
In the multiprocessor system, traffic on the shared bus increases if each processor directly accesses the shared memory for accessing the shared data. Therefore, a private cache memory is provided for each processor so that shared data read from the shared memory is written (copied) into the cache memory, and the cache memory will be accessed instead of the common memory when the access to the shared data is requested thereafter. Thereby, the bus confliction described above can be suppressed.
In the multiprocessor, since contents of the shared memory of, e.g., a main memory device is copied onto a plurality of cache memories, inconsistency in contents occurs between the plurality of cache memories in addition to inconsistency between the main memory device and the cache memory, and therefore it is impossible to main cache coherence (i.e., coherence in shared data written into the cache memories). The snoop cache overcomes this problem by utilizing features of the shared bus. In the multiprocessor system utilizing the shared bus, since the data communication is performed via one shared bus, behavior of the system can be determined by monitoring the flow of data on the shared bus. According to the snoop cache, transactions on the shared bus for the cache memories are actively snooped, and predetermined processing for maintaining the cache coherence is performed for ensuring consistency in contents when a transaction affecting contents of the cache memory is detected. The manner of handling a series of data for ensuring the content consistency is referred to as a cache consistency protocol. The cache consistency protocol can be specifically classified depending on a time for achieving consistency in contents (the time of write through or the time of write back) as well as a manner (whether contents are to be invalidated or updated).
The cache consistency protocol, in which the contents become consistent at the time of write through, is referred to as a “write through cache”. The write through cache is the simplest cache consistency protocol, in which each line (to be referred to as a “cache line” hereinafter) of the cache memory can attain only two states, i.e., a valid (V) state and an invalid (I) state. Therefore, only one bit is required for a tag (which will be referred to as a “state tag” hereinafter). According to the write through cache, the control can be simple, but traffic on the shared bus increases because the shared bus is utilized every time the data writing is performed. The cache line in the above description forms a unit of access performed by designating one time an address in the cache memory.
Then, description will now be given on a basic protocol of the write back cache, which is a cache consistency protocol classifies into a manner of invalidating the contents at the time of write back, and description will also be given on a Symmetry protocol, an Illinois protocol and a Berkeley protocol included in the write back cache.
First, the basic protocol will be described. According to the write back cache, a state tag of a cache line of a cache memory requires a bit representing a state of valid (V) or invalid (I), and additionally requires a bit representing a state of consistent (C: Clean) or inconsistent (D: Dirty) in contents of the cache line and the shared memory. Information relating to consistent/inconsistent has no meaning for the cache line in the invalid (I) state. Consequently, the state of the cache line selectively attains I (invalid), C (consistent with contents of shared memory) and D (inconsistent with contents of shared memory). The I (invalid) state of the cache line is a state, in which contents of the cache line are not ensured.
The Symmetry protocol will now be described. A variable used only in a certain program (task) is usually located in a stack region, and is accessed only by a specific processor(s). Therefore, the cache line corresponding to such a variable is arranged only in the cache memory of the specific processor. In the basic protocol described above, it is necessary to issue a signal for invalidating a copy of the variable when the writing is first effected on the variable, even if the copy of variable is not present in other cache memories. This can be avoided by adding a bit, which represents whether the copy of the cache line (variable) is present in another cache memory or not, to the state tag of the cache line. By the addition of such a bit, the state tag can selectively represents a state E (Exclusive: no copy is present in another cache memory) and a state S (Shared: copy may be present in another cache memory) in addition to the foregoing states.
The Illinois protocol will now be described. When an access request is present, and a cache line matching with an address related to the access request is not present in the cache memory (i.e., when cache mishit occurs), data is always read out from the shared memory according to the Symmetry protocol. However, according to the Illinois protocol, if a copy of the contents of the cache line causing the cache mishit is held by another cache memory (i.e., a different cache memory), the copy is transferred from the above different cache memory to the cache memory causing the cache mishit. In general, the cache memory operates faster than the shared memory so that it can be seemed that the Illinois protocol is preferable. However, if the copy is held in two or more different cache memories, it is difficult to select one from these cache memories for transferring the copy.
When a predetermined processor in the multiprocessor system has a cache line of a state tag of “D”, and a different processor operates to access the above cache line via the shared bus, the predetermined processor first transfers the cache line to the shared memory, then updates the state tag of this cache line to “C”, and thereafter transfers the contents of the cache line to the above different processor. According to the Symmetry protocol, the above processing is performed by writing the cache memory into the shared memory, and then transferring contents of the cache line from the shared memory to the different processor. According to the Illinois protocol, however, when the contents of the cache memory of the predetermined processor are written back into the shared memory, the data is simultaneously written into the corresponding cache line of the different processor.
Description will now be given on the Berkeley protocol. After the predetermined processor described above performs the data writing and changes the corresponding state tag of the cache line to “D”, a different processor may perform the reading from this cache line. In this case, the contents of this cache line are directly transferred to the different processor without using the shared memory. As a result, the cache line is in the state of “Dirty”, and the copy of this cache line is shared by the plurality of cache memories so that the state tag represents “DS” (Dirty Shared). However, if the copy of the cache line including the state tag “DS” is present in the plurality of cache memories, it is necessary to determine or specify the processor, from which the contents of the cache line are to be supplied to a further different processor, when the further different processor performs the reading from the same cache line. Accordingly, the processor (or the cache memory corresponding to this processor), which is finally responsible for each cache line, is determined. The cache memory or the shared memory, which is responsible for the cache line, is referred to as an “owner” of the cache line in question, and a right of the owner is referred to as an “ownership”. When requested, the owner supplies the cache line to an requester, and is also responsible for the write back into the shared memory. If the corresponding cache line is removed from the cache memory or is invalidated, the owner is responsible for transferring the ownership to the different cache memory, or writing back the contents of the cache line to the shared data, and is also responsible for returning the ownership to the shared memory, which is a default owner.
Then, description will be given on an update-based snoop cache protocol, in which consistency in the contents is achieved by updating.
In the update-based snoop cache protocol, consistency in the contents is achieved by using the data written into the cache line of a certain cache memory, and updating a copy in a different cache memory with this written data.
The update-based snoop cache protocol can be classified into a three-state Firefly protocol, a four-state Firefly protocol and a Dragon protocol. The three-state protocol is the simplest protocol among the update-based snoop cache protocols. In the three-state protocol, when the processor writes contents into the cache line including the state tag of “CS” (Clean Shared), the corresponding copy of the cache memories of the other processors are updated, and the contents of the shared memory are also updated. In this case, therefore, the state tag of the cache line is “CS”, and cannot be “DS”. When the writing is effected on the cache line including the state tag of “CE”, a “DE” (Dirty Exclusive) state occurs, in which case the writing can be performed without utilizing the shared bus. When a different processor reads out a cache line including a state tag of “DE” from a cache line of a certain processor, the contents of the cache line are read after being temporarily written back into the shared memory, as is done according to the basic update-based protocol already described. Therefore, the state tag of this cache line is “CS”. According to this three-state Firefly protocol, consistency in contents of the shared memory and the cache memory is frequently achieved so that the Dirty state does not occur in the plurality of cache memories. Therefore, it is not necessary to give consideration to the ownership or the like.
Description will now be given on the four-state Firefly protocol. According to this protocol, when a different processor reads out a cache line including a state tag of “DE” from a cache memory of a certain processor, the contents of this cache line are not written back into the shared memory, similarly to the Berkeley protocol, and are directly transferred to the different processor. As a result, a copy in the Dirty state is present in a plurality of cache memories, and the concept of the ownership is required. Accordingly, the four-state protocol is required, similarly to the Berkeley protocol.
Description will now be given on the Dragon protocol. According to the Dragon protocol, when data is to be written into the cache line including the state tag of “CS”, contents of the shared memory are not updated, and therefore a copy in the Dirty state is present in a plurality of cache memories. Similarly to the Berkeley protocol, therefore, the cache line may selectively attains the four states “SO”, “EO”, “EN” and “SN”. When the data is written into the cache line including the state tag of “SN” or “SO”, the contents of the shared memory are not updated. Similarly to the four-state Firefly protocol, when cache mishit occurs in connection with the cache line including the state tag “SN” due to the access request by another processor, write-back into the shared memory is not performed. Accordingly, contents of this cache line are supplied to a requester from the cache memory having the cache line of the state tag “SO”. According to the Dragon protocol, a change occurs in the state of the cache line including the state tag of “SO” in such a case that the cache line is removed from any one of the cache memories sharing this cache line. Only in this case, the contents of the cache line are written back into the shared memory.
The foregoing protocols can be compared as follows. In the multiprocessor system, which has a shared bus and a share memory, and operates according to the snoop cache, a manner (protocol) of handling data shared by processors determines the frequency of use of the shared bus, and significantly affects the performance of the system. Each of the protocols described above has both merits and demerits, and cannot be suitable to the all types of shared data. The suitable one between the update-based protocol and the invalidation-based protocol must be determined with consideration given to the properties of the program executed in the multiprocessor.
For example, the following results can be obtained from comparison, which is made between the update-based protocol and the invalidation-based protocol with particular consideration given to the access form of the processor. The invalidation-based protocol is effective for variables such as a local variable and others, which are very likely to be continuously accessed by one processor. The update-based protocol is effective for variables used in the case where many processors frequently change the data.
The followings can be derived from comparison, which is made between the update-based protocol and invalidation-based protocol with particular consideration given to, e.g., a hit rate for the cache memory and a sharing rate of data between the processors. The hit rate represents a rate of cache hit. The cache hit means a state, in which required data is effectively present in the cache memory. Cache mishit means a state, in which required data is not effectively present in the cache memory.
The hit rate for the cache memory and the sharing rate of data between the cache memories (processors) form important factors, which determine the efficiency of the multiprocessor system having the cache memories. As the hit rate increases, the access to the shared bus due to cache mishit decreases. As the sharing rate decreases, bus access resulting from the cache hit (write hit) to the shared data in the write operation decreases. In any one of the above cases, the bus confliction decreases, and a good efficiency is achieved. When comparing the invalidation-based protocol and the update-based protocol from the above viewpoint, it is apparent that the update-based protocol can increase a high hit rate, but also increases a sharing rate.
The followings are results of comparison made between the update-based protocol and the invalidation-based protocol with consideration given to a ping-pong effect. According to the invalidation-based protocol, two processors may share a cache line for frequently performing the writing and reading. In this case, when one of the two processors performs the writing on the cache line of the corresponding cache memory, the corresponding cache line (copy) of the cache memory of the other processor is invalidated. When the other processor subsequently performs the writing on the cache line (copy) in the corresponding cache memory, contents of the corresponding cache line of the cache memory of the above one processor are first written back into the shared memory, and then are transferred to the other processor. The contents thus transferred are written into the cache memory of the other processor, and the other processor performs the writing on the cache line. Thereby, invalidation is performed in connection with the cache line of the one processor. This is the most undesired and inefficient portion in the invalidation-based protocol.
More specifically, when two processors performs the writing or reading on the shared cache line, the above inefficient operation or processing is carried out whenever such writing or reading is performed, and contents of the cache line to be read or written are moved to and from each of the cache memories corresponding to the two processors while invalidating the other. This state is referred to as the ping-pong effect. In the update-based protocol, however, all the copy is updated with data written into the cache line, and therefore, the ping-pong effect does not occur.
The update-based protocol described above suffers from false sharing, which will now be described. It is assumed that data is transmitted or moved between two processors. When a program (task), which has been executed by one of the two processors, is moved to the other processor, the one processor no longer uses a variable related to this task. In the update-based protocol, however, the cache line is not invalidated so that the cache line including data of the variable will be valid until it is removed from the cache memory. Therefore, when the other processor writes the data into the cache line of the above variable during execution of the above task, the data (i.e., data not used by the one processor) of the corresponding cache line in the cache memory of the one processor must be updated using the shared bus, although this update is not originally necessary.
A situation similar to the above occurs in the following case. It is assumed that local variables used by one processor are stored in an upper half region of a cache line, and local variables used by the other processor are stored in a lower half region of the same cache line. In this case, according to the update-based protocol, meaningless updating of data must be performed whenever the writing is effected on the variables. This meaningless sharing of the cache line is referred to as the false sharing. According to the update-based protocol, wasteful traffic increases on the shared bus when the false sharing occurs.
For the above reasons, it can be considered that the suitable type or kind of the cache consistency protocol for the cache memory changes depending on the memory region (i.e., variable or work area). If the traffic on the shared bus can be reduced, the bus confliction is suppressed, and the connectable processors increase in number, so that the performance of the whole system increases. Therefore, many conventional systems have been devised such that the cache consistency protocol is fixed to only one kind, but the cache consistency protocol is controlled and switched dynamically corresponding to the respective memory regions for the foregoing reasons so as to reduce the traffic on the shared bus and thereby to improve the performance. For example, manners using the invalidation-based protocol and the update-based protocol in a mixed fashion have been proposed.
For example, a manner which is referred to as “Competitive snoop” has been proposed. According to this manner, the update-based protocol is first applied, and operations of updating the contents of the cache line are counted. When the count exceeds a predetermined number, the protocol changes into the invalidation-based protocol. According to this manner, unnecessary data updating can be avoided to a certain extent, but the data is invalidated in some cases while the data is being frequently exchanged between the processors. Therefore, the performance cannot be improved sufficiently.
The kind of protocol to be applied must be selected for each processor. In this manner, attention has been given to the fact that a better effect can be achieved if the invalidation-based protocol and the update-based protocol can be switched in accordance with properties of the program during execution of the program (task). More specifically, such a manner may be employed that a mechanism for controlling the cache memory is independently provided for each of the update-based protocol and the invalidation-based protocol, and switching is performed to use either of these mechanisms when executing the program. However, this manner has the following disadvantages. It is improper that one of the update-based protocol and invalidation-based protocol is determined as the effective protocol in the whole system at various points in time during execution of the task, but an appropriate protocol should be determined for each cache memory (each processor). Further, it is necessary to ensure consistency in the write operation effected on shared data during switching between the invalidation-based protocol and update-based protocol. In particular, the latter must be satisfied because control of a decentralized type is performed in the snoop cache. For satisfying the above requirements, the following measures are employed.
One item of attribute information is assigned to each cache memory. This attribute information represents either an “invalidation-based protocol mode” or an “update-based protocol mode”. When write hit for shared data occurs in a certain cache memory, a copy existing in other cache memory(s) is invalidated when the attribute information of the other cache memory represents the “invalidation-based protocol mode”, and is updated when the “update-based protocol mode” is represented. This manner ensures the consistency in arbitrary combinations of the “invalidation-based protocol mode” and the “update-based protocol mode”. If the mechanism for controlling the cache memories can be practically achieved, this manner satisfies the former of the foregoing requirements. Thus, both the invalidation-based protocol and the update-based protocol simultaneously exist. In connection with this mechanism, the latter of the foregoing requirements is satisfied if such conditions are satisfied that the operation of switching the attribute information of each cache memory and the writing on the shared date are controlled exclusively to each other.
If the above exclusive control can be sufficiently performed in the case of decentralized control such as snoop cache, the attribute information can be switched without synchronization between the cache memories. Thus, dynamic switching of the attribute can be performed during execution of the program. However, a significant advantage cannot be achieved by changing the protocol for the same memory depending on the processor.
Description will now be given on a manner of selecting the kind of the cache consistency protocol for each page of the shared memory. “Fine Grain Support Mechanisms” (Takashi Matsumoto, pp. 91–98, Calculator Architecture Research Report Meeting No. 77–12, Information Processing Society of Japan, July 1989) has disclosed that a disadvantage due to fixing of the protocol only to one kind can be overcome if the kind of catch consistent protocol can be dynamically switched for each page.
According to the technique disclosed in the above reference, if information representing the type of the protocol for each address is added for dynamically switching the protocol, a storage region of such information and hardware for management require considerable volumes, and therefore the system cannot be efficient. Therefore, management is performed by adding information, which represents the type of protocol, to each storage region having a predetermined size. In view of easy achievement, attention is given to a page management mechanism in the following description, and it is assumed that the above information is added to each page of the shared memory. Data belonging to the page is information indicating the type of protocol to be used for access. When the processor accesses the shared memory, the type of protocol to be used for this access is indicated, and, for this indication, a signal line for externally sending a bit, which indicates the type of protocol, is provided for each processor. When the cache memory outputs an access request onto the shared bus, i.e., when data communication via the shared bus is required, a signal line indicating the type of protocol is output to the shared bus, and the shared bus is snooped while selecting the protocol based on the signal on this signal line in connection with the cache memory. According to analysis of access patterns by compiler and/or instruction of the protocol by a programmer, a variable and a work area are assigned to the page in the shared memory having a suitable type of protocol.
According to the technique disclosed in the foregoing “Fine Grain Support Mechanisms”, selection of the invalidation-based protocol and the update-based protocol cannot be performed for each processor so that further fine control of the cache is absolutely impossible.
In view of the background described above, the update-based protocol is very suitable to the case where data transmission frequently occurs between the processors in the multiprocessor system. The update-based protocol has a disadvantage relating to the false sharing. Therefore, it is desired to eliminate the false sharing from the update-based protocol while using the update-based protocol as a base protocol. More specifically, if it is possible to determine, in advance, an address of a variable, which is frequently accessed by a certain processor, the false sharing can be reduced, and the hit rate for the cache memory can be increased.
In a reference “High-Performance Multiprocessor Workstation (TOP-1)” (Atsushi Moriwaki and Shigenori Shimizu, pp. 1456–1457, Information Processing Society of Japan 38th (1989) Meeting Transactions), it is suggested to use both the update-based protocol and the invalidation-based protocol. However, this reference has disclosed no suggestion for avoiding the foregoing false sharing so that bus confliction of the shared bus cannot be avoided.
Japanese Patent Laying-Open No. 2001–109662 has disclosed a manner, in which data on a cache memory in a multiprocessor system is managed by subdividing its state corresponding to respective blocks for improving a performance of access to the cache memory. The manner disclosed in this reference cannot likewise suppress the false sharing, and therefore cannot avoid wasteful use of the shared bus.