Many data processing systems may include processing circuitry that is arranged to execute multiple program threads (i.e. the processing circuitry supports more than one hardware thread context). The multiple program threads may comprise separate applications, or may instead comprise different processes within an individual application that are allowed to execute in parallel. Such processing circuitry that can execute multiple program threads is often referred to as a multi-threaded processor and such multi-threaded processors can take a variety of forms. One example of such a multi-threaded processor is a simultaneous multi-threaded (SMT) processor where the processor can issue operations for multiple different threads at the same time.
It is becoming more commonplace for one or more of the program threads to be considered to be of a higher priority than other program threads. As an example, this will typically be the case in a real-time system where a high priority, real-time, program thread is required to complete operations within certain deadlines, whilst other program threads are lower priority, non-real-time, threads, which do not have any such deadlines and can progress under the shadow of the real-time program thread.
Ideally, progress of a high priority program thread should not be delayed by any lower priority program threads, but at the same time it is desirable for the lower priority program threads to be able to progress as much as they can so as to increase the overall processor throughput. However, in reality, this ideal situation is difficult to achieve because in a multi-threaded processor there are typically several resources shared between the various program threads being executed. In particular, there are typically a number of storage units that are shared between the multiple program threads and which comprise multiple entries for storing information for reference by the processing circuitry when executing those program threads. Examples of such storage units are one or more caches for storing instructions to be executed by the processing circuitry and/or data used by the processing circuitry when executing such instructions, and/or one or more translation lookaside buffers (TLBs) provided for reference by the processing circuitry when performing instruction fetch or data access operations in order to provide access control information relevant to that fetch or access operation. As yet another example of a storage unit that may be shared, branch prediction circuitry used to predict whether branch instructions are taken or not taken will typically include a number of storage structures referenced when making that prediction, for example a branch history buffer (BHB) which retains summary information about the direction a branch took the last few times it was executed, a branch target buffer (BTB) for storing target addresses of branch instructions, etc.
The sharing of such storage units introduces conflicts among different threads. For example, a cache line whose content belongs to a high priority program thread can have that content evicted from the cache by a lower priority program thread causing a linefill operation to be initiated in that cache. Clearly, such activities are not desirable for the high priority program thread, as they may delay its execution time, due to a subsequent cache miss occurring, requiring a linefill operation to be performed in order to access the required information from a lower, typically larger and slower, level of the memory hierarchy of which the cache is part.
In addition to the potential performance degradation of a high priority program thread that can result from the sharing of such storage units, the total energy consumption can also be a significant concern, and the sharing of the storage units can give rise to various energy hungry activities. For example, the energy consumption can increase if a high priority program thread or a lower priority program thread is forced to access a lower level in the memory hierarchy more frequently due to interference with the contents of a cache level higher in the memory hierarchy by another program thread executing on the processing circuitry. This for example may happen when a particular executing program thread causes eviction of instruction or data cache lines from a level one cache belonging to another program thread, or similarly causes evictions of data access information held in entries of a data or instruction TLB belonging to another executing program thread. Such evicted information will typically be placed in a lower level of the memory hierarchy, for example in a level two cache. As a level two cache is typically larger than a level one cache, more energy is consumed in accessing the level two cache when compared with the level one cache, due for example to the larger address comparisons required to determine whether a requested piece of information is currently stored in the cache.
A similar situation can arise in the branch prediction circuitry, since collisions in the branch predictor entries stored for example in the BHB or BTB structures can drop the branch prediction accuracies of all program threads. This in turn causes the processor to flush instructions from its execution pipeline when a branch misprediction occurs, which then increases the overall system energy as the required instructions then need to be re-fetched and re-executed.
Accordingly, in summary, it will be appreciated that there are a number of potentially adverse effects that can be caused by a lower priority program thread being executed by the processing circuitry, where those adverse effects result from sharing of one or more storage units between the multiple program threads executed by the processing circuitry. For example, as discussed above, the activities of the lower priority program thread may be intrusive to a high priority program thread, and hence impact the performance of the high priority program thread, or alternatively the activities of the lower priority program thread may significantly increase energy consumption within the processing circuitry.
A number of studies have been performed in respect of multi-threaded systems seeking to execute a high priority program thread and at least one lower priority program thread, and the general focus of those studies has been to improve the overall throughput and/or performance of the high priority program thread by allocating shared resources dynamically amongst the program threads. For example, the paper “Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance” by G Dorai et al, Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, 2002, investigates a variety of resource allocation policies aimed at retaining the performance of the high priority program thread as high as possible while performing low priority operations along with the high priority operations required by the high priority program thread. The mechanisms described ensure that low priority program threads never take performance-critical resources away from the high priority program thread, whilst allocating resources to low priority program threads that do not contribute to performance of the high priority program thread. Whilst such techniques can improve the performance of the high priority program thread when executed in such a multi-threaded environment, the low priority program threads have to wait for opportunities to use the various resources that they require, and this can significantly impact the overall processor throughput.
The Article “Predictable Performance in SMT Processors” by F Cazorla et al, Proceedings of the First Conference on Computing Frontiers, April 2004, describes an SMT system in which collaboration between the operating system and the SMT hardware is used to enable the OS to enforce that a high priority program thread runs at a specific fraction of its full speed, i.e. the speed that it would run at if it was the only thread executing. There are two phases in the proposed mechanism that are executed in alternate fashion. During the first phase, referred to as the sample phase, all shared resources are given to the high priority thread, and any low priority threads are temporarily stopped. As a result, an estimate is obtained of the current full speed of the high priority thread during this phase, which is referred to as the local IPC (instructions per cycle). Then, during the second phase, referred to as the tune phase, the amount of resources given to the high priority thread is dynamically varied in order to achieve a target IPC that is given by the local IPC computed in the last sample phase multiplied by a percentage provided by the operating system to identify the fraction of the full speed at which the OS requires the high priority thread to operate. During this tune phase, the high priority program thread runs along with the low priority program threads, with resources being dynamically allocated in favour of the high priority program thread. When the target IPC is reached before the end of the interval of the tune phase, then more resources are allocated to the low priority threads.
This approach has a number of disadvantages. Firstly, it is necessary to compute the local IPC periodically, during which time none of the low priority program threads can execute. Further, the operating system scheduling mechanism needs to be modified so as to specify a desired target speed for the high priority program thread. Such an approach hence lacks flexibility, particularly in more complex systems where the number of program threads executing may vary over time.
Accordingly, it would be desirable to provide an improved technique for managing execution of multiple program threads in a multi-threaded environment so as to seek to alleviate adverse effects caused by a lower priority program thread and resulting from sharing of at least one storage unit between the multiple program threads being executed.