The present invention relates to cache memory systems and more particularly to a hierarchical caching system suitable for use with shared bus multiprocessors.
Over the past few years, many methods have been developed to interconnect the processors and memory units of a tightly coupled multiprocessor system. One such solution has been to organize one group of processors on one side of the system and a group of memories on the other side of the system with a central switching network connecting the processors to the memories. Shared address space machines have also been built or proposed which distribute the memories among the processors using a hierarchical network for interconnection. Since the switching networks impose added delay, and main memory is often not fast enough to keep up with the processors, it becomes necessary to add private caches to each of the processors to allow them to run at full speed. The addition of such private caches raises cache coherency issues, which for a general switching network, are very difficult to resolve. In fact, the systems proposed to date to maintain coherency often become the system bottlenecks.
Another method of interconnecting a group of processors of a multiprocessor system is connection through a shared bus. Since all traffic between the caches and main memory travels on the shared bus, each cache can ensure that it contains only current copies of main memory locations by monitoring the bus for transfers which might result in memory location obsolesence. While shared buses have often been faulted for inadequate bandwidth, the use of large private caches drastically reduces the required bandwidth while providing the processors with short effective access times. In addition, the extra bus traffic generated by the coherence algorithms is negligible so the full bandwidth of the bus is available to handle real data. As with uniprocessors, the private caches may be of either the write-through or write-deferred type, with corresponding write-through or write-deferred coherency schemes. Because bus bandwidth is the principal limiting factor in expansion of these systems, write-deferred schemes are generally preferred, but the write-through schemes are simpler to implement.
Coherency problems between several private caches in a multiprocessor system develop when several private caches have copies of the contents of a particular memory location. Obviously, it is essential that these copies be identical. If a processor modifies a copy of the memory location in its cache, that modification must either migrate to main memory or to all of the other private caches having a copy of that location. In the alternative, all of the other caches can be invalidated. In practice, the invalidation method is usually required as it eliminates a number of race conditions which can develop if updating of copies of the memory location in the other caches is attempted. With non-shared bus switching schemes, traffic due to invalidation messages can become quite large. Shared buses, however, can provide these messages implicitly.
With a write-through caching scheme, each time a private cache's copy of a location is written to by its processor, that write is passed on to the main memory over the shared bus. As indicated in the state diagram for a cache location in the write-through cache based multiprocessor of FIG. 1, all other caches in the system will be monitoring the shared bus for writes, and if any other caches contain a copy of the memory location, they will invalidate that copy. If none of the other caches ever actually use that location again, then the coherency scheme produces no additional bus traffic. If one or more of the processors request that location at a later time, extra main memory reads will result. Simulation experiments have shown that these extra reads are infrequent and contribute very little to extra bus traffic. Because the caches are write-through, the amount of traffic on the bus will never be less than the sum of all the processor generate memory writes, which typically comprise 15%-20% of memory requests. Write-through caching is highly effective where limited parallelism (on the order of 20 medium speed processors) is required and is simple to implement.
Write-deferred caching schemes have been successfully utilized by uniprocessor designers who have found that write-deferred caches significantly reduce the amount of main memory traffic from that of write-through caches. In theory, as the cache size approaches infinity, the memory traffic approaches zero. Current practical cache sizes enable reduced bus traffic in a multiprocessor system utilizing write-deferred caching of up to an order of magnitude over write-through caches, thereby potentially adding an order of magitude or more processors to a shared bus multiprocessor system. The necessity of coherency maintainence complicates the situation, and several additions to the basic write-deferred scheme are usually required. A more complicated system with higher bus utilization rates than a pure write-deferred system results, but still much lower utilization per processor than with a write-through scheme is achieved.
Several variations of shared bus oriented write-deferred caching systems exist which maintain cache coherency. In "Using Cache Memory to Reduce Processor Memory Traffic" (10th International Symposium on Computer Architecture) by Dr. James R. Goodman, a caching system (hereinafter the "Goodman system") is described which utilizes an initial write-through mode for recently aquired data to invalidate other caches in the event of a local modification to the data. As shown in the state diagram of FIG. 2, in the Goodman system, when a main memory location is initially accessed it enters the cache in either the valid state (if a read) or a reserved state (if a write). A location already in the valid state will enter the reserved state if a processor write access occurs. The write which caused the transition into the reserved state will be passed through the cache and onto the shared bus. Subsequent processor writes to that location will place it in a special state indicating that the cache it is in has the only correct copy of the location. This state is referred to as a "dirty" state.
As in the case of write-through caching, all caches in the Goodman system monitor the bus for writes which affect their own data and invalidate it when such a write is seen. Thus, after sending the initial write-through, a cache is guaranteed to have the only copy of a memory location, and can write at will to it without sending further writes to the shared bus. However, the cache must monitor the shared bus for any reads to memory locations whose copies it has been modifying, for after such a read, it will no longer have an exclusive copy of that location. If only the initial write-through write has occurred, then the only action necessary is for the cache which had done the write to forget that it had an exclusive copy of the memory location. If two or more writes have been performed by the cache's associated processor, then it will have the only correct copy and must therefore update main memory before main memory responds to the read request of the other caches.
There are several known systems for updating main memory before main memory responds to the read request of the other caches which ensure that the read request which triggered that update receives the most recent copy of the memory location. These systems generally require the cache containing the modified copy to prevent the memory unit from completing the read transaction until the modified copy has been returned. Another possibility is to issue a special bus request which supercedes the reply from memory. The Goodman system can also be modified to avoid the initial write-through if it is known when the data is first accessed that no other copies exist. There are some race conditions involving simultaneous modification of non-overlapping subpieces of a block by different processors and implementation of read-modify-write locks, but these are solvable with careful design.
Another known system which avoids the problem of sneaking back dirty data to main memory adds a "checked out" bit to each block in main memory. When a private cache discovers it needs exclusive access to a location, it reads the location from main memory with an exclusive access read which sets the checked out bit and transfers ownership of the location to the private cache. Other caches which have copies of the data will invalidate those copies upon sensing the exclusive read on the bus. Thus, the scheme substitutes a second, special read for the initial write-through of the Goodman system. While this scheme avoids the need for the initial write-through, the total amount of traffic provided to the bus by this scheme is comparable to that of the Goodman system. This scheme does provide straight forward solutions to the read-modify-write lock problems mentioned above.
In spite of the increased complexity, any of the write-deferred schemes outlined above will generally be preferrable to use of write-through caches. The bus traffic reduction provided by write-deferred caches makes such caches highly desirable for a multiprocessor shared bus system, even though bus saturation still limits the size of these computers.
It is therefore a principal object of the present invention to provide a caching system for a shared bus multiprocessor with multiple hierarchically organized shared buses which reliably maintains cache coherency.
Another object of the present invention is to provide a caching system which allows the partitioning of a multiprocessor into a plurality of multiprocessors utilizing multiple hierarchically organized shared buses.
A further object of the present invention is to provide a caching system for a multiprocessor with multiple hierarchically organized shared buses which requires a minimum use of the shared bus.
Yet another object of the present invention is to provide a caching system for a multiprocessor with multiple hierarchically organized shared buses which can utilize write-through or write-deferred caches.