The present invention relates to microprocessors which execute multi-threaded programs, and in particular to the handling of blocked (waiting required) memory accesses in such programs.
Many modern computers support "multi-tasking" in which two or more programs are run at the same time. An operating system controls the alternating between the programs, and a switch between the programs or between the operating system and one of the programs is called a "context switch."
Additionally, multi-tasking can be performed in a single program, and is typically referred to as "multi-threading." Multiple actions can be processed concurrently using multi-threading.
Most modern computers include at least a first level and typically a second level cache memory system for storing frequently accessed data and instructions. With the use of multi-threading, multiple programs are sharing the cache memory, and thus the data or instructions for one thread may overwrite those for another, increasing the probability of cache misses.
The cost of a cache miss in the number of wasted processor cycles is increasing. This is due to the processor speed increasing at a higher rate than the memory access speeds over the last several years and into the foreseeable future. Thus, more processors cycles are required for memory accesses, rather than less, as speeds increase. Accordingly, memory accesses are becoming a limited factor on processor execution speed.
In addition to multi-threading or multi-tasking, another factor which increases the frequency of cache misses is the use of object oriented programming languages. These languages allow the programmer to put together a program at a level of abstraction away from the steps of moving data around and performing arithmetic operations, thus limiting the programmer control of maintaining a sequence of instructions or data at the execution level to be in a contiguous area of memory.
One technique for limiting the effect of slow memory accesses is a "non-blocking" load or store (read or write) operation. "Non-blocking" means that other operations can continue in the processor while the memory access is being done. Other load or store operations are "blocking" loads or stores, meaning that processing of other operations is held up while waiting for the results of the memory access (typically a load will block, while a store won't). Even a non-blocking load will typically become blocking at some later point, since there is a limit on how many instructions can be processed without the needed data from the memory access.
Another technique for limiting the effect of slow memory accesses is a thread switch. A discussion of the effect of multi-threading on cache memory systems is set forth in the article "Evaluation of Multi-Threaded Uniprocessors for Commercial Application Environments" by R. Eickemeyer et al. of IBM, May 22-24, 1996, 23rd Annual International Symposium on Computer Architecture. The IBM article shows the beneficial effect of a thread switch in a multi-threaded processor upon a level 2 cache miss. The article points out that the use of separate registers for each thread and instruction dispatch buffers for each thread will affect the efficiency. The article assumes a non-blocking level 2 cache, meaning that the level 2 cache can continue to access for a first thread and it can also process a cache request for a second thread at the same time, if necessary.
The IBM article points out that there exist fine-grain multi-threading processors which interleave different threads on a cycle-by-cycle basis. Coarse-grain multi-threading interleaves the instructions of different threads on some long-latency event(s).
As pointed out in the IBM article, switching in the Tera supercomputer, which switches every cycle, is done in round-robin fashion. The Alewife project is cited as handling thread switching in software using a fast trap.
It would be desirable to have an efficient mechanism for switching between threads upon long-latency events.