The present invention relates generally to multi-core chips having a parent core and a scout core, and more specifically, to prefetching for a parent core in a multi-core chip.
Single thread processor performance growth has been limited due to power requirements needed for single thread performance. Doubling the power requirements of a processor through increased frequency and/or functional features does not necessarily yield a performance gain greater than or equal to the increased power requirement. This is because the performance gain to power gain ratio is significantly skewed. To provide chip performance growth, significant portions of the power budget may be devoted to placing additional cores on a chip. While cache and memory sharing prevents the performance increase from being equal to the ratio increase in the number of cores, the performance gain for increasing a core count on the chip may yield a greater performance/watt gain than solely improving the performance of a single core processor.
In one approach to enhance single thread performance, a secondary core on the same chip as a primary or parent core may be leveraged as a scout core. Specifically, the scout core may be used to prefetch data from a shared cache into the parent core's private cache. This approach may be especially useful in the event the parent core encounters a cache miss. A cache miss occurs when a particular line of data causes a search of a directory of the parent core, and the requested line of cache is not present. One typical approach to obtain the missing cache line is to initiate a fetch operation to a higher level of cache. The scout core provides a mechanism that is used to prefetch data needed by the parent core.
It should be noted that various programs behave differently, and as a result one prefetching algorithm or approach may not always improve latency of accessing cache content. In approach to prefetch data to the parent core, a relatively small and simple algorithm which is a stride engine may be provided to speculatively prefetch data based on a stride observed between subsequent cache misses. To cover more complex patterns additional hardware is required which can be more complex and larger in physical size and power. However, per the chip trade-offs for throughput, latency, and wattage balancing, the amount of dedicated hardware that is available for prefetching may be core limited. Moreover, the amount of area and storage needed to monitor and detect cache misses may be too large to implement uniquely in hardware.