The present invention relates generally to the field of computer processors and more specifically to increasing throughput in simultaneously multi-threaded processors.
A computer processor is the heart of any computer system. They are responsible for processing the instructions that make all of the functions of the computer possible. Computer processors are also called central processing units (CPU's) and microprocessors. A pipelined computer processor has multiple stages that each instruction must traverse during the processing phase. An exemplary five stage pipelined processor contains the following stages: fetch, decode, memory access, execute and write-back. During the fetch stage, an instruction is fetched from a register, or buffer. The instruction is decoded in the decode stage to determine the type of operation to be conducted, and what operand(s) are needed to complete the operation. The required operands are retrieved during the memory access stage and execution of the instruction occurs during the execute phase. The result of the executed instruction is then written back to memory during the write-back stage. Many processors have more than five stages and some processors have more than one pipeline. However, some features of pipelined processors are the same for all pipelines. Specifically, once an instruction enters a pipeline any stall caused by the instruction will cause the entire pipeline to stall. When the pipeline is stalled, no output is produced and performance drops. Thus, preventing pipeline stalls is an important factor in achieving optimal performance in microprocessors.
Microprocessors run on a timing schedule that is coordinated by a clock. The clock provides timing signals referred to as cycles. Movement of instructions, operands and results are preferably completed upon each clock cycle. A given stage within a pipeline, such as the execute stage, may take more than one clock cycle to complete. However, the execute stage is preferably broken into multiple sub-stages so that at the end of each clock cycle some output is produced and allowed to enter the next stage. In this way, the microprocessor produces some output at the end of each clock cycle. Both clock cycle and clock frequency can be used to describe the speed of the processor. A computer with a short clock cycle will have a high clock frequency. Generally, the higher the clock frequency the faster the computer, or more accurately, the faster the computer is able to process instructions.
A thread is a line, or stream, of computer instructions that when processed achieves some objective of the computer or the computer user. Simultaneously multithreaded processors allow for the execution of two or more potentially independent instruction streams concurrently. While only one instruction can occupy any one stage of a pipeline at a time, having instructions from other threads ready for processing increases system performance. To make most efficient use of the available hardware and avoid duplication of function, some pipeline resources are shared among all threads. For a given thread to occupy a shared resource, its instruction stream must at some point be merged with the instruction streams of the other threads. For the purpose of this application, the act of an instruction from any given thread merging into a shared pipeline resource is defined as “issue”. After an instruction issues, a data dependency could cause it to stall in a shared resource until the dependency is resolved, stalling all threads which require the same resource. The impact of this problem is magnified in high frequency designs because the pipeline depth requires that the decision to issue a particular instruction be made one or more cycles before operand availability is known. This increases the chance of a dependent instruction stalling in a shared resource awaiting required operands. If instead, issue was delayed until operand availability was known, overall system performance would be negatively affected in cases where the operands would have been ready at the time the dependent instruction required them. Single threaded performance would also suffer due to the increased latency, while multithreaded performance and/or efficiency would suffer due to not utilizing every possible opportunity to issue an instruction from a given thread.
Prior attempts to remedy this problem involved blocking a thread with the dependency from issuing until its operand data is ready for forwarding. This was a suitable solution for lower frequency designs, but is not optimal for high frequency designs which must make the issue decision one or more cycles before operand availability is known due to the pipeline depth. Using this prior method in processors with a high clock frequency introduces penalty cycles to a specific thread's overall latency each time a dependency is encountered, since operand data can not be used as soon as it becomes available.