The present invention relates to cache coherency control method and multi-processor system using the same.
Recently, the construction of a multi-processor system has becoming popular in order to improve a data processing throughput of a computer system. In the multi-processor system it is common that each of the processors separately owns a cache system. When a plurality of cache systems are provided, a plurality of copies of the same data naturally exist among those systems and it becomes necessary to maintain coherency of the cache data among the plurality of processors. A minimum unit of information which is the subject of storage management is handled as one block between the plurality of cache systems and main memories connected thereto, and the data is transferred by block unit. The maintenance of the cache coherency is attained by invalidating the same cache block as that held by other cache or updating the same cache block into latest data written into its own cache when one processor conducts a write operation to its own cache.
A protocol for maintaining the coherency of the cache data among a plurality of processors is commonly referred to as a cache coherency protocol. It includes the following two systems. First one is called as a directory system in which information items on each block of a main memory are managed at one point in the system. In this system, a logically single directory which describes the status of all blocks on the main memory is provided, and a particular cache on which a copy of each block is stored and a state thereof are recorded in the directory. The directory is, in many cases, implemented in a physically distributed form on the main memory but it is logically singly managed. When the cache system executes the writing to one block, it first refers to the management table to determine if the block has been copied to any other block, and then notifies the writing to the cache system having that block. When the cache system is notified the writing, it operates to maintain the cache coherency.
However, in this directory system, the directory is always referred to before the cache access. As a result, a time from the issuance of a process request to the completion of the process (latency) increases.
A second system is referred to as a snoop system. In this system, all caches hold information on blocks owned by themselves and always monitor a shared bus connecting each cache system and a main memory. In the snoop system, a cache system which conducts a write operation sends the intent of the writing to the shared bus. Other cache systems detect the writing information from the shared bus and determine whether their own systems own that block or not. If it owns, the cache system conducts the control to maintain the cache coherency. In the snoop system, since it is necessary that all cache systems are connected to the shared bus, it is not suitable for a large scale multi-processor system but the latency is shorter than that of the directory system because the determination of the ownership of the copy is conducted parallelly by the individual cache systems and it has been adopted in a number of multi-processor systems.
A coherency protocol in the snoop system is classified into two, write invalidate and write multicast depending on an operation in the writing, and a number of systems including modification thereof have been proposed. "Computer Architecture", Chapter 8, by Henecy and Paterson, discloses cache coherency protocols in a number of multi-processor systems. Many of articles referred to in that reference are found in "The Cache Coherence Problem In Shared Memory Multiprocessor: Hardware Solutions", IEEE Computer Society Press. A cache coherency protocol which is implemented in recent microprocessors is a protocol in Intel Pentium microprocessor. This is disclosed in "Pentium Processor Architecture and Programming", Chapter 18, Intel Japan Co., Ltd. In the Pentium microprocessor, a cache block is managed in four states, Modified, Exclusive, Shared and Invalid (so-called "MESI-values").
The cache coherency protocol of the multiprocessor system by the MESI algorithm includes one adopted by an IBM PowerPC microprocessor. Detail of this system is described in "Power and PowerPC", Chapter 9, Morgan Kaufmann Publishers, Inc. FIG. 2 shows a cache coherency control operation by this protocol.
In FIG. 2, "Invalid" indicates that no effective data is present in the cache block. "Shared" indicates that the same data as that of a main memory (clean data) is present in the cache block but a copy of that data is present in other cache. Namely, it indicates that the clean data of the cache block is shared (or sharable) by other cache. "Exclusive" indicates that the same data as that of the main memory (clean data) is present in the cache block and a copy of that data is not present in other cache. "Modified" indicates that data which may possibly be different from that of the main memory is stored in the cache block and a copy of that data is not present in other cache. When a data is written into the cache block, the written data becomes dirty data which may possibly be different from that of the main memory. Thus, in "Shared", "Exclusive" and "Modified", unlike "Invalid", an effective data to be referred to is present in the cache block.
When a read request is issued from a processor to a cache system, the cache system responds to the reception of the request to first refer to a cache tag memory to determine a state of the block. If the state of the block is "Modified", "Exclusive" and "Shared", it is determined as cache hit and the content of the cache memory is read and sent to the processor. The state of the cache block is left unchanged. On the other hand, if the state of the block is "Invalid", it is determined as cache miss and a read request transaction is issued to the shared bus. Other cache systems snoop the read request transaction from the common bus to check states of their own caches, and if the block is "Modified", "Exclusive" and "Shared", it changes to "Shared". If the block is "Modified", the Modified data is written back to the main memory as the latest data. Thus, the data of the block coincides with the data in the main memory. The data written back to the main memory is read to the shared bus and transferred to the requesting cache system. The requesting cache system sends the received data to the processor and stores the data as "Shared". In order to improve the latency in reading data, the Modified data may be directly transferred to the requesting cache system concurrently with writing it in the main memory.
When a write request is issued from a processor to a cache system, the cache system responds to the reception thereof to first refer to the cache tag memory to determine the state of the block. If the state of the block is "Modified" and "Exclusive", it is determined as cache hit and data is written into the cache block and the block state is changed to "Modified". If the state is "Shared" or "Invalid", it is determined as cache miss and a write request transaction is issued to the shared bus. Other cache systems snoop the write request transaction to check the states of their own caches, and if the block is "Modified", "Exclusive" and "Shared", it changes to "Invalid". If the block is "Modified", the Modified data is written back to the main memory. The data written back to the main memory is read to the shared bus and transferred to the requesting cache system. The requesting cache system merges the received data with the data contained in the write request and stores it as "Modified".
The cache coherency protocol by the MESI algorithm has thus been described. In implementing the cache coherency protocol of the snoop system, a problem of a throughput of the shared bus first occurs. As a performance of the processors connected to the shared bus is improved and as the number of processors connected increases, a throughput required increases more and more. It is thus necessary to improve the implementation throughput of the shared bus while reducing the requested throughput from the processor and the cache system. The improvement of the implementation throughput of the shared bus may be commonly attained by using a high speed operation clock and extending a data width. If the implementation by the bus is not feasible, it may be attained by using an interconnecting network which functions in the same manner as the bus. The reduction of the required throughput from the processor or the cache system is, in many cases, attained by increasing a cache capacity of the cache system or improving the cache structure.
However, those approaches need a large cost.
A second problem in implementing the snoop system is a shortage of throughput relating to the status determination in the snoop. A write operation notice flowing over the shared bus includes an address of the block but in order to determine a state in the cache of the block corresponding to the received address, it is necessary to refer to the cache tag memory which stores tags of blocks held therein. Namely, the cache system conducts the reference to the cache tag memory each time of the access request from other cache system. However, the cache system conducts the reference to the cache tag memory during the data supply service to the processor, in addition to the reference described above. Since the state of the cache block in the cache tag memory which is referred to by both sides should be logically singly managed, the access to the cache tag memory is usually conducted exclusively. The switching of the access causes the shortage of the throughput.
In the prior art, in order to solve the shortage of the throughput, duplicate of the cache tag is provided and the access from the shared bus first refers to the duplicate tag.
However, since it is common to use very high speed memory elements for the cache tag memory for storing the cache tags, the duplication of the cache tag memory is against the cost performance. Further, when a large capacity cache is adopted to increase the hit ratio, the capacity of the cache tag also increases. This is also a factor to impede the duplication.
As described above, an invalidate request is issued to invalidate a block owned by other cache system when a processor conducts a write operation to its own cache. A prior art method to reduce the invalidate request is disclosed in JP-B-6-64553 "Stack Control Circuit", in which a cache system has a plurality of stacks for temporarily storing invalidate requests (specifically, addresses of blocks to be invalidated) received from other processors, compares invalidate addresses among stacks, and when the invalidate addresses coincide, one of them is deleted to reduce the multiple invalidate process to the same address.
However, in this prior art method, since the duplication of the invalidate requests stored in the stacks is detected to reduce the invalidate process therebetween, the reduction can be attained only for the invalidate requests having close reception times. Namely, it is not effective unless the same invalidate request is repeatedly issued in a short time. When the stack capacity is increased to extend the stay time, the duplicate detection effect may be enhanced but the invalidate requests are delayed. When the cache system owns the dirty data and it is notified to the requesting system, the transfer of the latest data is also delayed. Since this delay directly affects the access latency, the holding of a plurality of invalidate addresses in the stacks is a significant loss in performance.
Other prior art technique for reducing the invalidate request is disclosed in "Issues in Multi-Level Cache Designs", 1994 IEEE International Conference On Computer Design: VLSI in Computer and Processors (ICCD '94). In this article, a table called an invalidate history table for recording invalidate requests is introduced. A technology disclosed in this article is briefly explained with reference to FIGS. 12 and 13. FIG. 12 shows a four multi-processors (multi-processors 0.about.3) each having a primary cache of 32K-byte capacity and a 4M-byte secondary cache connected to each of the multi-processors. As shown in FIG. 12, the cache systems are in a double hierarchy. FIG. 13 shows an example of a history table for sequentially recording the invalidate requests (invalidate addresses) issued by the respective primary caches. The history table is loaded in a tag memory of the secondary cache. FIG. 13 shows the history table as well as an address register for storing a given invalidate address, a secondary cache tag table, a secondary cache hit determination circuit for comparing addresses (tags) stored in the secondary cache tag table with the address stored in the address register to determine the hit, and a history table hit determination circuit for determining the hit on the history table.
When an invalidate address is issued from the primary cache, the history table is referred to, and if it hits, it is determined that the invalidation has already been made and the request is deleted. When a first invalidate request to a block is issued, the address of the block is not yet registered in the history table and, in this case, the address is registered in the history table and the invalidate request for that address is issued to all other primary caches.
By this arrangement, the invalidate requests to other primary caches other than the first one request are eliminated and the process for the invalidate request in other primary caches is reduced.
However, in this prior art technology, the states of all primary caches connected to the secondary cache are centrally managed by the history table. For example, when a coherency request from a primary cache is issued, the history table of the secondary cache is first referred to, and if it does not hit, the coherency request is transferred to the corresponding primary cache. Namely, in the prior art technology, it is nothing but the arrangement of a directory of the directory system as viewed from the primary cache in the secondary cache. In this method, the transfer is conducted twice and the access latency for the coherency request increases.