In a multi-core coherent system, multiple CPU and system components share the same memory resources, such as on-chip and off-chip RAMs. Ideally, if all components had the same cache structure, and would access shared resource through cache transactions, all the accesses would be identical throughout the entire system, aligned with the cache block boundaries. But usually, some components have no caches, or, different components have different cache block sizes. For a heterogeneous system, accesses to the shared resources can have different attributes, types and sizes. On the other hand, the shared resources may also be in different format with respect to banking structures, access sizes, access latencies and physical locations on the chip.
To maintain data coherency, a coherence interconnect is usually added in between the master components and shared resources to arbitrate among multiple masters' requests and guarantee data consistency when data blocks are modified for each resource slave. With various accesses from different components to different slaves, the interconnect usually handles the accesses in a serial fashion to guarantee atomicity and to meet slaves access requests. This makes the interconnect the access bottleneck for a multi-core multi-slave coherence system.
To reduce CPU cache miss stall overhead, cache components could issue cache allocate accesses with the request that the lower level memory hierarchy must return the “critical line first” to un-stall the CPU, then the non-critical line to finish the line fill. In a shared memory system, to serve one CPU's “critical line first” request could potentially extend the other CPU's stall overhead and reduce the shared memory throughput if the memory access types and sizes are not considered. The problem therefore to solve is how to serve memory accesses from multiple system components to provide low overall CPU stall overhead and guarantee maximum memory throughput.
Due to the increased number of shared components and expended shareable memory space, to support data consistency while reducing memory access latency for all cores while maintaining maximum shared memory bandwidth and throughput is a challenge. Speculative memory access is one of the performance optimization method adopted in hardware design.