The present invention relates to multiprocessor systems, and more particularly to maintaining a multitude of independent coherence domains in a multiprocessor system.
Advances in semiconductor fabrication technology have given rise to considerable increases in microprocessor clock speeds. Although the same advances have also resulted in improvements in memory density and access times, the disparity between microprocessor clock speeds and memory access times continues to persist. To reduce latency, often one or more levels of high-speed cache memory are used to hold a subset of the data or instructions that are stored in the main memory. A number of techniques have been developed to increase the likelihood that the data/instructions held in the cache are repeatedly used by the microprocessor.
To improve performance at any given operating frequency, microprocessors with a multitude of processing cores that execute instructions in parallel have been developed. The processing cores (hereinafter alternatively referred to cores) may be integrated within the same semiconductor die, or may be formed on different semiconductor dies coupled to one another within a package, or a combination the two. Each core typically includes its own level-1 cache and an optional level-2 cache.
A cache coherency protocol governs the traffic flow between the memory and the caches associated with the cores to ensure coherency between them. For example, the cache coherency protocol ensures that if a copy of a data item is modified in one of the caches, copies of the same data item stored in other caches and in the main memory are updated or invalidated in accordance with the modification.
FIG. 1 is a block diagram of a microprocessor 20 (hereinafter alternatively referred to a processor) having four independent cores 101, 102, 103 and 104, as known in the prior art. Each core 10i is shown as including a level-1 (L1) cache 15i, where i is an integer varying from 1 to 4. Assume that a data requested by core 101 is not present in its L1 cache 151. To gain access to this data, core 101 issues a read request. Arbiter 25 receives and serializes all such read requests and transfers one of the serialized requests to cores 101, 102, 103 and 104 during each clock cycle. If the data associated with the read request made by core 101 is detected as being present in any of the L1 caches, the requested data is transferred to cache 151 from the L1 cache in which the data resides. If none of the L1 caches contain the request data, a copy of this data is retrieved from main memory 35 via system bus 30 and subsequently stored in cache 151.
As described above, a read request issued by arbiter 25 causes all caches to search for the requested data, thereby consuming power. Furthermore, since the response from all the cores must be received to determine if another cache contains the requested data, a core engaged in executing other instructions and unable to process the requested operation, slows down the process and adversely affects the time it would otherwise take to complete the read request.
As the number of cores in a microprocessor increases, the coherence traffic may become increasingly heavier thus resulting in even more power consumption and system bus bandwidth use. The problem is compounded when traffic requests from multiple I/O devices (I/Os) are also sent to the caches.