The present invention relates to an improved high performance multiprocessor computer system, and more particularly to a cache memory coherency control for distributed cache memories to be used therein.
There is significant ongoing research and development on scalable shared-memory multiprocessor systems capable of efficiently operating a plurality of processors in the order of tens to several thousands of units. Many of these systems adopt a so-called Non-Uniform Memory Access Architecture (NUMA) which has a distributed memory system configuration. That is, when a single memory is shared by several thousand processors in a system, the system cannot achieve its utmost performance due to a bottleneck likely to arise in concurrent accessing of the shared memory. The NUMA architecture is intended to solve such a problem by distributing the shared memory.
On the other hand, along with a current technical trend for the operating frequencies in processors to increase, access latency in accessing a main memory has become an important factor in determining system performance. To improve the latency, it is preferred for the main memory to be provided in the vicinity of the processors. In this respect also, a distributed memory system configuration (NUMA) having a local memory for each processor is preferable. According to such system configuration, there is room for further significant improvement in latency, since the operating frequency of local memories can be increased with an increase in operating frequencies in the processors. Typical examples of such distributed memory systems are listed below.
(1) DASH System at Stanford University: Daniel Lenoski, et. al., xe2x80x9cThe DASH Prototype: Implementation and Performancexe2x80x9d, Proc. 19th Int. Symp. on Computer Architecture, 1992. (2) SCI (Scalable Coherent Interface): David B. Gustavson, xe2x80x9cThe Scalable Coherent Interface and Related Standards Projectsxe2x80x9d, IEEE MICRO, pp.10-22, 1992. (3) IBM RP3 (Research Parallel Processor) The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecturexe2x80x9d, Proc. of the 1985 Int. Conf. on Parallel Processing, pp.764-771, 1985.
As an important problem to be solved in any distributed memory system, there is the problem of cache memory coherency control which must be implemented for respective cache memories distributed in several thousand processors. This mechanism is necessitated to maintain cache coherency among the contents of cached data in respective cache memories in respective processors.
Conventionally, in the case of a multiprocessor system consisting of several processors, a cache coherence protocol system, which is referred to as the bus snooping system, is generally adopted. This system, in which each processor is coupled to a shared bus, implements its cache coherence scheme by monitoring transactions on the shared bus. Namely, when a particular processor wishes to read particular data, it broadcasts the address of its data to the shared bus. Any of the other processors, which are snooping transactions on the shared bus, when it finds an updated version of the desired data in its own cache memory, transfers said associated data to the requesting processor.
However, when this bus-snooping system is applied directly to any shared memory multiprocessor system having as many as several thousand unit processors, the following problems may occur. A first problem is that it takes too much time from the broadcasting, of the data address to the several thousand processors until the reception of reports from all of the processors reporting each cache coherency. Thereby, in consequence, there occurs an associated problem that even if an access latency in an access to a local memory is reduced by the distributed memory configuration, a delay in cache coherency prevents an instant utilization of the data. Further, a second problem is that the load on the shared bus becomes excessively great. Namely, every time a processor reads or writes data from and to memory, a broadcasting is issued to every other processor. As a result, there occurs too many transactions to be executed on the shared bus when viewed in respect of the overall system. In addition, the frequency of cache coherence procedures by a shared-bus snooping unit in each processor increases thereby resulting in a bottleneck, resulting in a problem that the shared bus system cannot achieve its utmost performance.
As prior art cache coherency protocol methods to solve such problems as described above, there are known two approaches: the directory-based protocol approach and the software-controlled protocol approach. In the directory-based protocol approach, each distributed memory has a directory which keeps track of the cached data for all of the caches in the system. Use of this directory eliminates the used to provide for means for broadcasting to all of the processors or to the bus-snooping mechanism.
With respect to the directory-based protocol approach, there are two approaches, such as the mapping protocol approach and the distributed link protocol approach.
By way of example, the foregoing DASH system adopts a mapping protocol approach. The directory for the mapping protocol approach consists of a cache presence bit which indicates cache memories which have a copy of shared data. Thus, the presence bit needs to have the same number of bits as the number of cache memories provided in the system. As modifications of this mapping method, there are also known a limit mapping method and a group mapping method. The limit mapping method is one which can reduce the number of bits required for indicating the cache presence, by limiting the number of cache memories which are allowed to have a copy of data on the shared memory. Further, in the group mapping protocol method, a group including several processors is defined as a unit for setting a cache presence bit, thereby decreasing the number of bits required for the cache presence bit. In each group thereof, it is possible to implement cache coherence by means of the bus snooping protocol. The above-mentioned DASH system adopts, in practice, the group mapping protocol method.
The distributed link protocol which is one of the directory-based protocols has been adopted by the aforementioned SCI system. The distributed link protocol is a method for providing each data on a shared memory and cache memories with link information, and a linked list is formed by linking every copied data in cache memories and a shared memory. For example, if a particular processor issues a request to delete a copy of particular data from a shared memory on its associated cache, the cache coherence control traces down the corresponding link information for the shared memory data until it finds an initial copy thereof to delete it. When the initial copy has further link information, a subsequent copy thereof can be traced down via the link information then to be deleted. According to this method, the directory information can be decreased advantageously in comparison with the mapping protocol method.
Another important cache coherence protocol system, which is different from the directory-based protocol, is a software controlled protocol system, which is adopted by the above-mentioned IBM RP3 system. The software controlled protocol system is provided with functions capable of assigning attributes distinguishing between cachable and non-cachable data items per a unit of pages, for example, per 4K bytes, as well as of invalidating a particular cache memory entry from the user""s program. For example, a local data item characteristic to a particular task is assigned with a cachable attribute, while a data item which is shared between tasks is designated with a noncachable attribute. Then, when a task is transferred from one processor currently at work to another, the local data cached in the cache memory of the one processor is completely invalidated. Thereby, since it is insured that no copy of the local data thereof is present in the other cache memories, there is no need for a cache coherence protocol mechanism to be installed. In addition, since no copy of shared data is cached on other caches, there is no need of the cache coherence protocol itself. Further, according to another example, it may be conceived that, among data which needs to be shared between tasks, shared data for read-only is given a cachable attribute. It will be also possible to provide the whole of a shared data item to be shared between tasks with a cachable attribute. In respect of this case, it is limited to one task that is permitted to access the shared data by using a flag or semaphore. Any task, upon modification of its shared data, before clearing its flag or semaphore, must reflect the contents of the modification onto the main memory by means of a cache invalidate function. According to the software controlled protocol method described above, it is possible to provide a scalable shared memory multiprocessor which does not require hardware for implementing a cache coherence protocol mechanism, such as the bus-snooping mechanism or the directory-based mechanism.
One of the problems associated with the mapping protocol, which is one of the prior used directory-based protocol systems, is that the size of a directory tends to become excessively large, thus requiring a substantial time to read information from the directory. For example, presuming a system configuration in which a group of processors including 32 units are operating on shared memory with 512 Mbytes, and 32 bytes make up one block which is managed by the directory, the size of a directory will become 512 M bytes/32 bytesxc3x9732 bits=64 M bytes. Even if it is so arranged by the group mapping protocol method that four units of processors are grouped into one group, the size of its directory will be 16 M bytes. Further, there occur such problems that if the caches are implemented with DRAMs, the latency of access becomes large, and if they are implemented with SRAMs, the manufacturing thereof becomes costlier. As the latency of accessing the directory increases, the delay in the cache coherence protocol increases, thus failing to achieve any significant movement in latency of a shared memory.
Problems associated with the distributed link protocol, which is another example of the prior used directory-based protocol systems, are that the size of its directory tends to become large, and further, since the distributed link protocol carries out its cache coherence procedure by tracing down associated link information, the delay in the cache coherence protocol tends to increase. In respect of the size of the directory information, in a system in accordance with the above example, it becomes 512 Mbytes/32 bytesxc3x975 bits=10 Mbytes. Even through it has a smaller capacity in comparison with that in the mapping protocol, it still needs to be implemented with DRAM technology, thereby resulting in an increased access time. Another problem ascribed to the link information will be described by way of example as follows. Presume that a particular processor issues a request to invalidate each copy of shared data cached in other cache memories in order to update its own cache memory. At this time, the cache coherence protocol function first reads out link information of corresponding data in the shared memory; then, in accordance with its contents, it invalidates associated entries on other cache memories. This process must be repeated as long as the associated link exists. Thereby, there arises a problem that it takes a significant time until all of the copies in respective caches are invalidated.
Problems associated with the prior art software controlled protocol are that such advantages in the shared data accessing to be implemented by cache memories cannot be expected, thereby resulting in deteriorated access latency, since, in this method, no copies of shared data are cachable in the cache memories, and traffic concentration on the shared bus cannot be alleviated. Further, according to such a protocol method, whereby a copy of the shared data can be registered in a cache memory by software, it is required for the programmer to be always conscious of the cache coherency protocol, thus imposing an excessive burden on the programmer.
The main object of the present invention is to provide a cache coherence protocol system which is capable of executing cache coherency protocol transactions at a high speed and with minimized interprocessor communications quantities for a large scale multiprocessor system, and processors suitable therefor.
A first measure to solve the above-mentioned problems according to the present invention will be described in the following. According to the invention, there is proposed a multi processor system architecture comprising a plurality of clusters, a bus for interconnecting said plurality of clusters, a global shared memory, and a system control unit for controlling access from any processor in said plurality of clusters to the global shared memory, each one of said plurality of clusters comprising at least two processors, each having a cache memory and a translation lookaside buffer, a local shared memory, and a memory interface unit which is coupled to said at least two processors and the local shared memory and controls access from said at least two processors to the local shared memory, wherein
said translation lookaside buffer holds area limit attribute information which helps identify whether a cache coherence control is to be executed only for cache memories in one of said plurality of clusters or for every one of the cache memories throughout the system in response to an access request from any one of the processors.
Further, it is arranged according to the present invention that, for every access from any processor, there is provided area limit attribute information to be retained in its translation lookaside buffer, which helps identify whether a cache coherency protocol should be executed for every one of the cache memories in the system or only for such cache memories as are provided in a limited area of the clusters. Further, there are provided in the memory interface unit thereof cache consistency area determination means for determining a cache consistency area in dependence on the area attribute information retained in the address translation lookaside buffer, and broadcast means for broadcasting information to be utilized in cache coherence protocol to the associated processors within an area specified in accordance with a determination by the cache coherency area determination means. More specifically, the cache coherency area determination means of the invention is provided with a cluster number register for storing information indicative of the identification number of its own cluster, and a comparator for comparing the information retained in the cluster number register and a real address, translated from a virtual address, which was an access address, from any one of the processors, and wherein a limited area requiring a cache coherency protocol is determined in dependence on the result of comparison by the comparator and the area limit attribute information stored in the translation lookaside buffer.
Still further, it is arranged according to the present invention that a processor comprises an instruction cache memory for retaining a portion of instructions stored in a main memory, a data cache memory for retaining a portion of data stored in the main memory, an instruction fetch unit for fetching an instruction to be executed from the instruction cache memory or the main memory, an instruction execution unit which interprets the instruction fetched by the instruction fetch unit and then reads out data from the data cache memory or the main memory accordingly to execute the instruction thus interpreted, and a translation lookaside buffer for translating a virtual address issued from the instruction fetch unit or the instruction execution unit into a real address, wherein a plurality of processors as indicated above are interconnected together to constitute a computer system wherein area attribute information which defines a limited area of a plurality of cache memories for the plurality of processors for which cache coherency must be executed is retained in each translation lookaside buffer.
A second measure to solve the above-mentioned problems according to the invention will be described in the following. It is proposed in order to accomplish the second measure of the invention that a large scale multiprocessor system be divided into a plurality of clusters, each of which consists of a group including a plurality of processors and a main memory, and that each cluster includes an export directory. An export directory which is provided in each cluster is a setassociative directory which registers therein an identifier of any data in a particular cluster to which it is assigned, when copies of that data are cached in cache memories in an external cluster. In this architecture, each cluster includes at least one processor and at least one main memory therein. Cache memory consistency in processors for each cluster is maintained through a cache coherency protocol, such as bus-snoop or the like. Each entry of the export directory holds a physical address of the data, whose copy is cached in the clusters remote therefrom, and a status bit indicative of its status. The status bit represents either one of the three statuses of xe2x80x9csharedxe2x80x9d, xe2x80x9cdirtyxe2x80x9d and xe2x80x9cinvalidxe2x80x9d. The shared status represents that a corresponding data has its copies cached in an external cluster(s) but with no modification being applied. The dirty status represents that a corresponding data has its copies cached in an external cluster(s) with modification being entered in the contents of its data, while the invalid status indicates that a corresponding entry is invalid.
Further, overflow control means provided for the export directory has a function to invalidate a corresponding data from every one of the cache memories in the system corresponding to an entry in the export directory, which is purged out when there occurs an overflow in said export directory.
The operation of the above-mentioned first measure of the invention will be described in the following. When any processor issues a memory access request, a virtual address of the memory being addressed is translated into a real address by the translation lookaside buffer. At this time, in reference to particular area attribute information retained in the translation lookaside buffer which helps identify an area which requires cache coherency, a pertinent area for which cache coherence protocol is to be executed is determined for this memory access request.
Further, in the memory interface unit of the invention, the cache coherency area determination means determines an appropriate extent of the area for executing cache coherency in dependence on the area limit attribute information held in the translation lookaside buffer and a real address (memory address) which has been translated by the translation lookaside buffer. Subsequently, pertinent information to be utilized in cache coherence procedures is broadcast by broadcast means only to such processors which are directly involved in a limited area determined by the cache coherency area determination means.
Thereby, it becomes possible to define a cache coherence area which can be limited in accordance with various characteristics of data, such as whether it is local data, shared data, a stack region, etc. In particular, in a very large scale multiprocessor system, since a cache coherence area can be limited as indicated above, it is no longer necessary for every one of the caches.in the system to be addressed to ensure cache consistency, thus resulting in a substantial improvement in latency in the cache coherence protocol. In addition, since it is limited only to such processors within the limited area that the information to be utilized in cache coherency is broadcast, it becomes no longer necessary for every one of the processors in the system to be addressed through broadcasting for every occasion of a memory read/write, thereby substantially reducing the amount of processor-to-processor communications.
Further, in a processor system interconnecting a plurality of processors, wherein each processor comprises an instruction cache memory, a data cache memory, an instruction fetch unit for fetching an instruction to be executed from this instruction cache memory or the main memory, an instruction execution unit which by interpreting the fetched instruction fetched by the instruction fetch unit reads out corresponding data from the data cache memory or the main memory in order to execute a thusly interpreted instruction, and a translation lookaside buffer for translating a virtual address issued from the instruction fetch unit or the instruction execution unit into a real address, since it is arranged that area attribute information which defines an appropriate area for a plurality of cache memories present in the plurality of processors for which cache coherency must be maintained is retained in each translation lookaside buffer, it becomes possible to limit the appropriate extent of the area for which cache coherency is to be maintained in dependence on the various characteristics of data, thereby providing a processor suitable for use in a multiprocessor system interconnecting a plurality of such processors.
The operation of the above-mentioned second measure of the invention will be described in the following. When there occurs a read access to the main memory in a given cluster from outside thereof, the address of subject data is registered in the export directory of the given cluster. On this occasion, the status of entry registered therein is determined by the types of access from outside the given cluster. Namely, when it is intended for use as a reference only, it will be registered in a shared state, and when it is a data read for updating, it will be registered in a dirty state. A corresponding entry in the export directory is invalidated when data exported outside its cluster is invalidated, or when the corresponding data is written back to its home cluster upon being purged out of the cache memories.
When a given processor issues a memory access request, a cache coherency transaction is executed within a given cluster which contains the given processor.. At the same time, in this event, an export directory within the given cluster is searched to verify whether or not any copy of the subject data is cached in cache memories in the other clusters outside the given cluster. When it is verified as a result of the search that no copy of the subject data is cached outside its cluster, cache coherency to be maintained is required only within its cluster. On the other hand, when a copy of the subject data is verified to have been cached outside its cluster and its status bit indicates a necessity of cache coherency, its memory address is broadcast to every one of the clusters to execute cache coherency procedures therein. As a result of such cache coherency procedures, if it is required, an inter-cluster cache-to-cache data transfer will be executed.
When the export directory overflows, the address of any entry which was purged therefrom is sent to the overflow control means described above. Then, the overflow control means broadcasts the address thereof to every one of the clusters so as to invalidate the copies of the corresponding data.
By adopting such an arrangement of the invention, it becomes possible to limit an area for which cache coherency is required in dependence on information stored in the export directory. In particular, in any large scaled multiprocessor system, if a cache coherency area can be limited, there will be no need any more for every one of the caches in the system to be addressed to maintain cache coherency except for those within such a limited area, thereby latency in the cache coherency control will be greatly improved. Further, since it is no longer required to broadcast to every one of the processors within the system for every occasion of an access of a memory read/write, the amount of communication between processors can be reduced substantially.