A thread, in the context of computer science, generally refers to a thread of execution. Threads are a way for a program to divide itself into two or more simultaneously (or near simultaneously) running tasks. Multiple threads can be executed in parallel on many computer systems, a process often referred to as hardware multithreading. Hardware multithreading is an attractive technology to increase microprocessor utilization. By interleaving operations from two or more independent threads, even if a particular thread is stalled waiting for high-latency operations, functional units can be utilized by other threads.
As described in Michael Gschwind, “Chip Multiprocessing and the Cell Broadband Engine,” ACM Computing Frontiers 2006, the disclosure of which is hereby incorporated by reference, multithreading, as a design feature, has become particularly attractive in recent years to tolerate the increasing latency of memory operations, to increase the number of parallel memory transactions, and to better utilize the available memory bandwidth offered by a microprocessor,
While hardware multithreading offers attractive aspects of increased memory-level parallelism, thread-level parallelism and better microprocessor utilization, among other benefits, care must be taken to ensure that design of multithreaded microprocessors does not degrade overall performance by introducing additional design complexity which will degrade either clock frequency or the latency of pipelines by introducing additional stages.
An example of this tradeoff is the scheduling of threads for access to specific resources. On the one hand, full flexibility and dynamic scheduling decisions based on core utilization factors and thread readiness increase the ability to perform useful work. On the other hand, this flexibility increases the control overhead and puts scheduling logic in the critical path of each operation step in the microprocessor front-end.
In one design approach, at least a portion of the microprocessor, such as the microprocessor front-end responsible for fetching instructions, uses one of various static access schemes. In one static access scheme, threads are statically interleaved on alternating cycles. In yet other schemes, other static access patterns, e.g., also including thread priorities and so forth, can be provided. However, when using any statically determined threading scheme, access to resources can suffer when statically determined access patterns do not align with resource availability.
To mitigate any potential performance degradation based on this limitation, some embodiments for instruction caches may support instruction cache bypass, wherein data being written into the instruction cache can also be simultaneously fetched by a thread. This is advantageous, as a thread having caused an instruction miss is typically idle until said data returns, and providing data corresponding to the address having previously caused an instruction miss will allow the stalled thread to continue fetching, decoding and executing instructions when its queues would otherwise have been drained.
However, when static thread scheduling for instruction fetch is combined with a restricted cache access and bypass architecture as described hereinabove, degradation can ensue when a thread cannot bypass data during the data return cycle because it is not scheduled in accordance with the thread access policy, and misses the instruction fetch access opportunity to bypass the returned data in response to a cache miss. A thread having missed this bypass opportunity will then have to restart accesses after instruction cache writes have completed, instruction cache writes typically being of higher priority than instruction fetch accesses, and thereby suffer considerable program degradation.
In another aspect of instruction fetch, namely, instruction fetch of caching inhibited storage, in accordance with the definition of architectures such as the state-of-the-art industry-standard Power Architecture, cache inhibited accesses cannot be stored and retrieved from the cache. Instead, cache inhibited accesses must always use the bypass path, and hence cannot be reliably performed in the described environment.
Attempts have been made to address these performance issues in a variety of ways, including the use of dual-ported caches, the use of prefetch buffers, and/or the use of dynamic thread access policies. However, each of these conventional techniques suffers from significant problems and is therefore undesirable.
Dual-ported caches offer attractive properties in terms of independent operation of instruction cache reload and instruction fetch, but increase the area of instruction caches significantly. They also do not offer a solution for fetching from caching-inhibited storage, as such data must not be stored in the cache.
The use of prefetch buffers allows decoupling completion of memory subsystem response to a cache reload request and actual committing of the data to the cache by offering the ability to buffer several full cache lines and defer their writeback to a suitable time with respect to a thread being scheduled. Typically, prefetch buffers also offer bypass capabilities from the prefetch buffer to the instruction fetch logic, without requiring concurrent operation of the cache. However, this design choice increases the cost in terms of area due to the size and number of the prefetch buffers, the extra wiring necessary to bypass the prefetch buffers in an area of great congestion around and above an instruction cache array, and the additional levels of multiplexing needed to select from one of a plurality of prefetch buffers, as well as between prefetch buffers and instruction cache.
The use of a dynamic thread access pattern, as previously described, increases design complexity. Such increased design complexity, in turn, leads to increased design cost, longer timing paths and/or deeper pipelining, with the inherent degradation of architectural performance as expressed in CPI (cycles per instruction). In addition, the use of a dynamic thread access pattern increases both verification cost and design error susceptibility, and is therefore undesirable.
Accordingly, there exists a need for techniques for obtaining data in a manner which further increases microprocessor utilization and which does not suffer from one or more of the above-noted problems exhibited by conventional data fetching methodologies.