1. Field of the Invention
The present invention generally relates to computer systems having multiprocessor architectures and, more particularly, to a novel multi-processor computer system for processing memory accesses requests and the implementation of cache coherence in such multiprocessor systems.
2. Description of the Prior Art
To achieve high performance computing, multiple individual processors have been interconnected to form multiprocessor computer system capable of parallel processing. Multiple processors can be placed on a single chip, or several chips—each containing one or several processors—interconnected into a multiprocessor computer system.
Processors in a multiprocessor computer system use private cache memories because of their short access time (a cache is local to a processor and provides fast access to data) and to reduce number of memory requests to the main memory. However, managing caches in multiprocessor system is complex. Multiple private caches introduce the multi-cache coherency problem (or stale data problem) due to multiple copies of main memory data that can concurrently exist in the multiprocessor system.
Small scale shared memory multiprocessing system have processors (or groups thereof interconnected by a single bus. However, with the increasing speed of processors, the feasible number of processors which can share the bus effectively decreases.
The protocols that maintain the coherence between multiple processors are called cache coherence protocols. Cache coherence protocols track any sharing of data block between the processors. Depending upon how data sharing is tracked, cache coherence protocols can be grouped into two classes: 1) Directory based and 2) Snooping.
In directory based approach, the sharing status of a block of physical memory is kept in just one location called the coherency directory. Coherency directories are generally large blocks of memory which keep track of which processor in the multiprocessor computer system owns which lines of memory. Disadvantageously, coherency directories are typically large and slow. They can severely degrade overall system performance since they introduce additional latency for every memory access request by requiring that each access to the memory go through the common directory.
FIG. 1 illustrates a typical prior art multiprocessor system 10 using the coherence directory approach for cache coherency. The multiprocessor system 10 includes a number of processors 15a, . . . , 15d interconnected via a shared bus 24 to the main memory 20a, 20b via memory controllers 22a, 22b, respectively. Each processor 15a, . . . , 15d has its own private cache 17a, . . . , 17d, respectively, which is N-way set associative. Each request to the memory from a processor is placed on the processor bus 24 and directed to the coherency directory 26. Frequently, in the coherency controller, a module is contained which tracks the location of cache lines held in particular subsystems to eliminated the need to broadcast unneeded snoop request to all caching agents. This unit is frequently labeled “snoop controller” or “snoop filter”. All memory access requests from the I/O subsystem 28 are also directed to the coherency controller 26. Instead of the main memory, secondary cache connected to the main memory can be used. Processors can be grouped into processor clusters, where each cluster has its own cluster bus, which is then connected to the coherency controller 26. As each memory request goes through the coherence directory, additional cycles are added to each request for checking the status of the requested memory block.
In a snooping approach, no centralized state is kept, but rather each cache keeps the sharing status of data block locally. The caches are usually on a shared memory bus, and all cache controllers snoop (monitor) the bus to determine whether they have a copy of the data block requested. A commonly used snooping method is the “write-invalidate” protocol. In this protocol, a processor ensures that it has exclusive access to data before it writes that data. On each write, all other copies of the data in all other caches are invalidated. If two or more processors attempt to write the same data simultaneously, only one of them wins the race, causing the other processors' copies to be invalidated.
To perform a write in a write-invalidate protocol based system, a processor acquires the shared bus, and broadcasts the address to be invalidated on the bus. All processors snoop on the bus, and check to see if the data is in their cache. If so, these data are invalidated. Thus, use of the shared bus enforces write serialization.
Disadvantageously, every bus transaction in the snooping approach has to check the cache address tags, which could interfere with CPU cache accesses. In most recent architectures, this is typically reduced by duplicating the address tags, so that the CPU and the snooping requests may proceed in parallel. An alternative approach is to employ a multilevel cache with inclusion, so that every entry in the primary cache is duplicated in the lower level cache. Then, snoop activity is performed at the secondary level cache and does not interfere with the CPU activity.
FIG. 2 illustrates a typical prior art multiprocessor system 50 using the snooping approach for cache coherency. The multiprocessor system 50 contains number of processors 52a, . . . , 52c interconnected via a shared bus 56 to the main memory 58. Each processor 52a, . . . , 52c has its own private cache 54a, . . . , 54c which is N-way set associative. Each write request to the memory from a processor is placed on the processor bus 56. All processors snoop on the bus, and check their caches to see if the address written to is also located in their caches. If so, the data corresponding to this address are invalidated. Several multiprocessor systems add a module locally to each processor to track if a cache line to be invalidated is held in the particular cache, thus effectively reducing the local snooping activity. This unit is frequently labeled “snoop filter”. Instead of the main memory, secondary cache connected to the main memory can be used.
With the increasing number of processors on a bus, snooping activity increases as well. Unnecessary snoop requests to a cache can degrade processor performance, and each snoop requests accessing the cache directory consumes power. In addition, duplicating the cache directory for every processor to support snooping activity significantly increases the size of the clip. This is especially important for systems on a single chip with a limited power budget.
What now follows is a description of prior art references that address the various problems of conventional snooping approaches found in multiprocessor systems.
Particularly, U.S. Patent Application US2003/0135696A1 and U.S. Pat. No. 6,704,845B2 both describe replacement policy methods for replacing entries in the snoop filter for a coherence directory based approach including a snoop filter. The snoop filter contains information on cached memory blocks—where the cache line is cached and its status. The U.S. Patent Application US2004/0003184A1 describes a snoop filter containing sub-snoop filters for recording even and odd address lines which record local cache lines accessed by remote nodes (sub-filters use same filtering approach). Each of these disclosures do not teach or suggest a system and method for locally reducing the number of snoop requests presented to each cache in a multiprocessor system. Nor do they teach or suggest coupling several snoop filters with various filtering methods, nor do they teach or suggest providing point-to-point interconnection of snooping information to caches.
U.S. Patent Applications US2003/0070016A1 and US2003/0065843A1 describe a multi-processor system with a central coherency directory containing a snoop filter. The snoop filter described in these applications reduces the number of cycles to process a snoop request, however, does not reduce the number of snoop requests presented to a cache.
U.S. Pat. No. 5,966,729 describes a multi-processor system sharing a bus using a snooping approach for cache coherence and a snoop filter associated locally to each processor group. To reduce snooping activity, a list of remote processor groups “interested” and “not-interested” in particular cache line is kept. Snoop requests are forwarded only to the processor groups marked as “interested” thus reducing the number of broadcasted snoop requests. It does not describe how to reduce the number of snoop requests to a local processor, but rather how to reduce the number of snoop requests sent to other processor groups marked as “not interested”. This solution requires keeping a list with information on interested groups for each line in the cache for a processor group, which is comparable in size to duplicating the cache directories of each processor in the processor group thus significantly increasing the size of chip.
U.S. Pat. No. 6,389,517B1 describes a method for snooping cache coherence to allow for concurrent access on the cache from both the processor and the snoop accesses having two access queues. The embodiment disclosed is directed to a shared bus configuration. It does not describe a method for reducing the number of snoop requests presented to the cache.
U.S. Pat. No. 5,572,701 describes a bus-based snoop method for reducing the interference of a low speed bus to a high speed bus and processor. The snoop bus control unit buffers addresses and data from the low speed bus until the processor releases the high speed bus. Then it transfers data and invalidates the corresponding lines in the cache. This disclosure does not describe a multiprocessor system where all components communicate via a high-speed bus.
A. Moshovos, G. Memik, B. Falsafi and A. Choudhary, in a reference entitled “JETTY: filtering snoops for reduced energy consumption in SMP servers” (“Jetty”) describe several proposals for reducing snoop requests using hardware filter. It describes the multiprocessor system where snoop requests are distributed via a shared system bus. To reduce the number of snoop requests presented to a processor, one or several various snoop filters are used.
However, the system described in Jetty has significant limitations as to performance, supported system and more specifically interconnect architectures, and lack of support for multiporting. More specifically, the approach described in Jetty is based on a shared system bus which established a common event ordering across the system. While such global time ordering is desirable to simplify the filter architecture, it limited the possible system configurations to those with a single shared bus. Alas, shared bus systems are known to be limited in scalability due to contention to the single global resource. In addition, global buses tend to be slow, due to the high load of multiple components attached to them, and inefficient to place in chip multiprocessors.
Thus, in a highly optimized high-bandwidth system, it is desirable to provide alternate system architectures, such as star, or point-to-point implementations. These are advantageous, as they only have a single sender and transmitter, reducing the load, allowing the use of high speed protocols, and simplifying floor planning in chip multiprocessors. Using point to point protocols also allows to have several transmissions in-progress simultaneously, thereby increasing the data transfer parallelism and overall data throughput.
Other limitations of Jetty include the inability to perform snoop filtering on several requests simultaneously, as in Jetty, simultaneous snoop requests from several processors have to be serialized by the system bus. Allowing the processing of several snoop requests concurrently would provide a significant increase in the number of requests which can be handled at any one time, and thus increase overall system performance.
Having set forth the limitations of the prior art, it is clear that what is required is a system incorporating snoop filters to increase overall performance and power efficiency without limiting the system design options, and more specifically, methods and apparatus to support snoop filtering in systems not requiring a common bus.
Furthermore, there is a need for a snoop filter architecture supporting systems using point-to-point connections to allow the implementation of high performance systems using snoop filtering.
There is a further need for the simultaneous operation of multiple snoop filter units to concurrently filter requests from multiple memory writers to increase system performance.
There is further a need to provide novel, high performance snoop filters which can be implemented in a pipelined fashion to enable high system clock speeds in systems utilizing such snoop filters.
There is an additional need for snoop filters with high filtering efficiency transcending the limitations of prior art.