Snoopy caching has proven itself to be useful in building practical small-scale multiprocessors, such as Sequent's Balance 8000 (Fielland et al., (January 1984) "32-bit computer system shares load equally among up to 12 processors," Electron. Design, 153-168) and DEC's Firefly. Current designs for large-scale multiprocessors, such as BBN's Butterfly, NYU's Ultracomputer, Thinking Machines' Connection Machine, and IBM's RP3, are generally more restrictive than snoopy-cache machines. For instance, they may require all machines to execute the same instruction at the same time, or to issue memory requests synchronously, or to limit the performance of each processor to the switching delay of the network connecting the processors to memory. In each case, the restrictions imposed are due to the failure of the machine to present an efficient implementation of a shared-memory abstraction. Snoopy caching does, typically, present such an efficient implementation, because most programs exhibit sufficient locality of reference that local caches can satisfy most memory requests (Archibald et al. (November 1986) "Cache coherence protocols: evaluation using a multiprocessor simulation model," A CM Transactions on Computer Systems, 4:273-298; Frank (January 1984) "Tightly coupled multiprocessor system speed memory access times" Electronics 57:164-169; Goodman (July 1983) "Using cache memory to reduce processor-memory traffic," Proc. 10th Annual IEEE Intntl. Symposium on Computer Architecture, Stockholm 124-131; Katz et al. (July 1985) "Implementing a cache consistency protocol," Proc. 12'th Annual IEEE Intntl. Symposium on Computer Architecture, 276-283; and Vernon et al. (May 1986) "Performance analysis of multiprocessor cache consistency protocols using generalized timed Petri nets," Proc. of Performance '86 ACM/SIGMETRICS Conf. on Computer Performance Modelling and Evaluation, 11'th IFIP Working Group 7.3 intntl. symposium, NCSU).
Snoopy caching has its problems for large-scale multiprocessing. Foremost among these is the restricted memory bandwidth available on a single bus.
Thus, a problem to be addressed by this invention is establishing techniques for enhancing the available bandwidth to memory, by extending snoopy caching to networks of buses. Snoopy caching on a tree of buses is quite simple; unfortunately, most bus cycles in a snoopy cache go towards servicing read misses from main memory. Since the only path to memory is at the root of the tree, congestion at the root limits the size of the system. Therefore, a problem addressed herein is to extend snoopy caching to hypercubes of buses.
Another problem with existing snoopy algorithms is that they often use the bus quite badly. That is, the algorithm which maintains consistency will use many more bus cycles than are necessary. The problem here is to compare algorithms with the optimal algorithm, which knows the entire pattern of requests in advance. If this seems overly generous, just think of it as an algorithm that only knows the past, and which guesses the next few operations, and happens to be right. By using the bus more efficiently, the overall load on the network of buses is reduced, allowing addition of more processors.
Existing algorithms waste bus cycles in a number of ways. First, compare the exclusive-write protocol used in the Balance machine with optimal behavior. If a location that is actively shared is written to, the entire cache line is invalidated at low cost. But, since the location is actively shared, as each processor reads the location, the line must be read back into each cache at high cost. This will happen quite often in programs that have heavy contention for a block that is held briefly, for example. This request sequence could be handled much more cheaply by updating the other caches on each write.
On the other hand, consider the pack-rat protocol used in the Firefly. If a location was shared at some point in the past, but is now active in only one processor, updates to that location must still be transmitted to other caches, until the other cache takes a collision on the cache line containing this location. In the limiting case, if the caches are as large as the virtual address space (not wholly improbable, since the caches serve primarily to reduce main memory contention), eventually every location is shared, and all updates are write-through. This happens as soon as a thread moves from one processor to another. For this request sequence, the optimal response is to invalidate the line in all other caches as soon as the first write comes along.
In contrast to these algorithms, techniques presented herein achieve results which are always within a small constant factor of optimal. Such algorithms are called competitive. In order to be competitive, an algorithm must adapt to the situation so that, in the long run, it gets to the same state that the optimal algorithm uses. Unlike the optimal algorithm, it must hedge its bets, abandoning decisions when, and only when, those decisions become hopeless.
Algorithms that use invalidation also waste bus cycles by ignoring the broadcast capabilities of a bus when reading back previously invalidated blocks. The algorithm presented herein takes advantage of this to respond more quickly to changes in usage patterns.