1. Field of the Invention
The invention described herein relates to efficient usage of processor resources and for reducing average thread latency.
2. Background Art
Consider a multi-threaded multi-program streaming processor where threads must complete in the order they are created and instructions must be loaded into a local instruction cache from a memory device. In order to reduce instruction latency, instruction misses, and thereby the total resources (memory, per-thread buffers, etc.) in use at any time by the set of existing threads, instructions for older threads are typically executed before instructions for newer threads. Before an instruction is executed for a new program, instruction data must be loaded into the instruction cache from the memory device. This is a high latency operation and multi-threaded processors will typically switch to another thread while this load occurs in order to achieve maximum use of processor computational resources. If instructions for older threads running an older program are scheduled before instructions for newer threads with a new program, then the waiting period caused by an instruction fetch may be deferred until operations, such as loading resources required for execution into cache(s), have been completed for these older threads. The result is that when processor resources become free the instructions that will use these resources for newer threads have not yet been loaded into the instruction cache (or constants into data caches, etc.), and the processor resources will go unused until the high-latency instruction fetch has completed. Instruction data is not the only data that a processor may have to wait for after a state transition; constant data shared by threads in a program may also need to be reloaded when the program state changes.
One typically implemented method to avoid leaving processor resources unused during an instruction or data fetch is to pre-fetch instructions or data into a cache prior to execution. This often involves parsing in advance a program that is to be executed to determine which resources and instructions will be needed at a later time. However, doing so in a brute force manner for every thread of a program generally requires significant additional hardware complexity and chip area.
In an exemplary scenario, a program running two instructions X and Y is shown in FIG. 1. For the sake of discussion and not as a limitation, it is assumed herein that instruction X needs resource A and instruction Y needs resource B. In a typical scenario, X, Y, A and B can remain same between subsequent executions of the same thread (“thread 0”), though other inputs to thread 0 may vary between various time intervals, as is well known to those skilled in the art. Under normal circumstances, a fetch A operation for the first thread is performed, then instruction X is executed using A, and subsequently resource B is fetched and instruction Y is executed using B. Once resources A and B are loaded into the cache, these can be used by subsequent threads of same or any other programs, as and when necessary. However, in such a normal scenario, instruction Y cannot be executed fast enough and not immediately after instruction X has been executed because resource B is not fetched and loaded into the cache until instruction X completes execution. Similar situation exists for subsequent threads (“thread 1” and “thread 2”) which can also start in a staggered fashion, as and when various inputs and resources arrive, as shown in FIG. 1.
In a conventional system, when the first thread 0 is started using a normal program, instructions X and Y and resources A and B are likely to be not present in a cache. Thus, if instructions X and Y are to be executed in order, resource A is first fetched when thread 0 is started. Subsequently, instruction X is executed using resource A. Similarly, resource B needed to execute instruction Y is fetched after instruction X completes execution, and so on for additional instructions. A similar procedure occurs for subsequent threads, threads 1 and 2. Therefore, in such a scenario described in FIG. 1, the processor resources go idle because resource B has not yet loaded into the cache. This leads to an undesirable latency.
Accordingly, there is a need for a method and system that allows for minimizing the amount of time a cache is idle. A desired solution would have to avoid the pitfalls of a pre-fetch scheme, while otherwise addressing the above described latency problems in the caching of instructions and data.
Further embodiments, features, and advantages of the present invention, as well as the operation of the various embodiments of the present invention, are described below with reference to the accompanying drawings.