The present invention relates to microprocessors which execute multi-threaded programs, and in particular to the handling of blocked (waiting required) memory accesses in such programs.
Many modern computers support xe2x80x9cmulti-taskingxe2x80x9d in which two or more programs are run at the same time. An operating system controls the alternating between the programs, and a switch between the programs or between the operating system and one of the programs is called a xe2x80x9ccontext switch.xe2x80x9d
Additionally, multi-tasking can be performed in a single program, and is typically referred to as xe2x80x9cmulti-threading.xe2x80x9d Multiple actions can be processed concurrently using multi-threading.
Most modern computers include at least a first level and typically a second level cache memory system for storing frequently accessed data and instructions. With the use of multi-threading, multiple programs are sharing the cache memory, and thus the data or instructions for one thread may overwrite those for another, increasing the probability of cache misses.
The cost of a cache miss in the number of wasted processor cycles is increasing. This is due to the processor speed increasing at a higher rate than the memory access speeds over the last several years and into the foreseeable future. Thus, more processors cycles are required for memory accesses, rather than less, as speeds increase. Accordingly, memory accesses are becoming a limited factor on processor execution speed.
In addition to multi-threading or multi-tasking, another factor which increases the frequency of cache misses is the use of object oriented programming languages. These languages allow the programmer to put together a program at a level of abstraction away from the steps of moving data around and performing arithmetic operations, thus limiting the programmer control of maintaining a sequence of instructions or data at the execution level to be in a contiguous area of memory.
One technique for limiting the effect of slow memory accesses is a xe2x80x9cnon-blockingxe2x80x9d load or store (read or write) operation. xe2x80x9cNon-blockingxe2x80x9d means that other operations can continue in the processor while the memory access is being done. Other load or store operations are xe2x80x9cblockingxe2x80x9d loads or stores, meaning that processing of other operations is held up while waiting for the results of the memory access (typically a load will block, while a store won""t). Even a non-blocking load will typically become blocking at some later point, since there is a limit on how many instructions can be processed without the needed data from the memory access.
Another technique for limiting the effect of slow memory accesses is a thread switch. A discussion of the effect of multi-threading on cache memory systems is set forth in the article xe2x80x9cEvaluation of Multi-Threaded Uniprocessors for Commercial Application Environmentsxe2x80x9d by R. Eickemeyer et al. of IBM, May 22-24, 1996, 23rd Annual International Symposium on Computer Architecture. The IBM article shows the beneficial effect of a thread switch in a multi-threaded processor upon a level 2 cache miss. The article points out that the use of separate registers for each thread and instruction dispatch buffers for each thread will affect the efficiency. The article assumes a non-blocking level 2 cache, meaning that the level 2 cache can continue to access for a first thread and it can also process a cache request for a second thread at the same time, if necessary.
The IBM article points out that there exist fine-grain multi-threading processors which interleave different threads on a cycle-by-cycle basis. Coarse-grain multi-threading interleaves the instructions of different threads on some long-latency event(s).
As pointed out in the IBM article, switching in the Tera supercomputer, which switches every cycle, is done in round-robin fashion. The Alewife project is cited as handling thread switching in software using a fast trap.
It would be desirable to have an efficient mechanism for switching between threads upon long-latency events.
The present invention provides a method and apparatus for switching between threads of a program in response to a long-latency event. In one embodiment, the long-latency events are load or store operations which trigger a thread switch if there is a miss in the level 2 cache. A miss in a level 1 cache, or a hit in a level 2 cache will not trigger a thread switch.
In addition to providing separate groups of registers for multiple threads, a group of program address registers pointing to different threads are provided. A switching mechanism switches between the program address registers in response to the long-latency events.
In one embodiment, the next program address register to be switched to is indicated in a thread field within the long-latency instruction itself. In an alternate embodiment, the program address registers are switched in a round-robin fashion.
Preferably, in addition to the program address registers for each thread and the register files for each thread, instruction buffers are provided for each thread. In a preferred embodiment, there are up to four sets of registers to support four threads.
For a further understanding of the nature and advantages of the invention, reference should be made to the following description taken in conjunction with the accompanying drawings.