The present invention relates to processor design. More specifically, the present invention relates to a system that reduces the number of hardware components necessary for instruction pointer generation in a simultaneous multithreaded processor.
Multithreaded processors have become more and more popular in the art to minimize unproductive time spent by a processor. Multithreading enables a processor to perform tasks for a given thread until a specific event occurs, such as a certain number of execution cycles passing, a higher priority thread requiring attention, or the current thread being forced into a stall mode while waiting for data, and then beginning processing on another thread.
To facilitate multiple program threads being actively executed, simultaneous multithreaded (SMT) implementations require that multiple threads be fetched and readied for execution. Different methods exist in the art for fetching instructions from the various active threads. One approach is to utilize multiple ‘front ends’ to fetch and fill the de-coupling buffers that feed the ‘back end’ execution pipes—one for each thread. This approach requires a large amount of hardware for the multiple front ends (which also include the level one instruction cache).
An alternate, more common, method of facilitating multithreading is to time-multiplex between various threads in a ‘round-robin’ fashion. FIG. 2 provides an illustration of simple instruction pointer logic utilizing such a method of time-multiplexing between two threads. A problem with this method is that, not only do the multiplexers 218, 220 need to be duplicated, but also multiple storage elements (flip flops) 248,250,252,254,256 need to be introduced. These storage elements (one for each re-steer logic path) are required to capture the re-steer information (for re-direction to point to the proper instruction) on the inactive thread. Without them, the re-steer information would be lost during inactivity of the thread. Further, with this method of SMT, the number of inputs to each multiplexer 218,220 doubles. The additional inputs make the multiplexers 218,220 larger and exacerbate the critical timing path, which is already constrained by the addition of a next level multiplexer 246 to select between the multiple threads.
It is therefore desirable to have a system for a simultaneous multithreaded processor that minimizes the number of hardware components necessary as well as the complexity of design.