The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
In FIG. 1A, a graphical illustration of execution opportunities in a superscalar processor shows five consecutive cycles. Each cycle of the processor corresponds to one clock period of the processor clock. Because the processor is superscalar, multiple instructions can be executed in each clock cycle, indicated in FIG. 1A by four columns. However, in virtually all modern processors, the datapath is pipelined, meaning that some or all instructions require multiple clock cycles to be completed.
In standard software programming, the results of one instruction may be relied upon by the following instruction. This may force the processor to wait to execute the following instruction until the previous instruction is partially or fully completed. There are some instructions that do not depend on each other, and for example, may only depend on instructions completed in the past. These instructions can theoretically be executed in parallel, since each instruction does not require the output of the other. This is called instruction level parallelism.
In the example FIG. 1A, three execution slots in cycle 1 are used for execution. This means that the processor was able to identify instruction level parallelism and issue the three instructions at once. However, note that the processor was unable to identify a fourth instruction to execute in parallel. This represents wasted processing capability, which impacts performance and may also impact power consumption. In cycle 2, the processor is unable to issue any instructions; for example, subsequent instructions may require information from the instructions issued in cycle 1, from instructions issued in previous cycles, or from a storage location, such as level two cache, that has a multi-cycle latency.
In cycle 3, a single instruction is issued, and in cycle 4, two instructions are issued. Once again, in cycle 5, the processor is unable to issue any instructions. As can be seen in FIG. 1A, for this particular example, the limited instruction level parallelism causes many of the processor cycles to be wasted.
In FIG. 1B, an example processor offering fine-grained multithreading is shown. A first thread in this example uses the same instructions, which therefore have the same dependencies as the instructions in FIG. 1A. However, a second thread may be executed in cycles where the first thread is not executing. As a result, in cycle 2, two instructions are issued from a second thread, while in cycle 5, three instructions are issued from the second thread. The second thread may be another program or may be a second thread of the same program.
This fine-grained multithreading exploits thread level parallelism, in which instructions from multiple threads do not depend on each other's outputs, and therefore other threads may continue executing while a first thread is waiting for instructions to complete. However, there is still a significant amount of waste in terms of execution opportunities.
In FIG. 1C, a processor offering simultaneous multithreading is shown with the same example instructions from the first thread. In FIG. 1C, note that instructions from the second thread can be issued in the same cycles that instructions from the first thread are issued, which is why such a scheme is called simultaneous multithreading. The instructions for the second thread may also have interdependencies, and therefore some execution opportunities may be missed while instructions from both the first and second threads are waiting for instructions to complete.
Adding additional threads may allow for more execution opportunities to be used. However, with more simultaneous threads, complexity increases. Complexity may result in increased design effort, increased die area, and increased power consumption. To reduce complexity, various resources of the processor are partitioned. For example only, an instruction cache may be partitioned into sections each corresponding to one of the threads.