1. Field of the Invention
This invention relates to the field of data processing systems. More particularly, this invention relates to multiple thread instruction fetching from different cache levels of a cache memory hierarchy in a data processing apparatus.
2. Description of the Prior Art
In a system where processing circuitry executes multiple program threads (for example a simultaneous multi-threaded (SMT) processor where multiple threads are executed on a single processor core) performance is generally improved by providing increased instruction fetch bandwidth to supply multiple instructions to the multiple active threads executing in the processing circuitry. Typically, the fetch engine of a SMT processor is shared among the threads in a round-robin manner so that each thread receives a fair share of the instruction fetch bandwidth. This is considered to be a fair scheme when all threads have the same priority.
In a multi-threaded system that is designed for realtime applications, it is common for one thread to have higher priority than the other threads. In such a realtime multi-threaded system, the best way of implementing a multiple threaded fetch mechanism will not be the round-robin policy, since this is likely to degrade the performance of the high priority (HP) thread by lengthening its execution time. Assuming a single HP thread, one simple solution is to assign the full instruction fetch bandwidth to the HP thread at every cycle, and the low priority (LP) threads can only fetch when the HP thread stalls for some reason. However, although this policy keeps the execution time of the HP thread at a minimum, the adverse effect of this policy on the performance of the LP threads is rather undesirable since as a result they may only infrequently fetch and retire instructions. Consequently, the contribution of the LP threads to the overall instruction throughput of the realtime multi-threaded system may be significantly reduced. In practice the overall instruction throughput may not be significantly different from a single threaded processor, since it will be predominantly determined by the HP thread alone.
Hence it would be preferable to provide a multi-threaded system in which both HP and LP threads can fetch simultaneously, so that on the one hand the HP thread is not delayed by the fetching activities of the LP thread and on the other hand the LP thread can generate a greater instruction throughput by fetching in parallel with the HP thread. One solution would be to replicate the instruction cache for each thread to avoid the continual overwriting in a single instruction cache by competing threads (known as “thrashing”), but this option is not a cost effective solution. Alternatively, making the instruction cache multi-ported would allow each thread to fetch independently, however multi-ported caches are known to be very expensive and energy hungry. A further alternative solution would be to partition the instruction cache into several banks such that the HP and LP threads can fetch simultaneously. However, this can have a negative effect on the cache access time, since bank conflicts must be arbitrated before giving access to the instruction cache when two requests try to access the same bank. Since instruction cache access time is usually critical for processor performance, lengthy instruction cache access times are usually avoided by designers. Hence, neither replicated instruction caches nor multi-ported/banked instruction cache designs are likely to be desirable solutions.