A Central Processing Unit (CPU) is the computing and controlling core of the computer. The CPU's basic operating process comprises the following phases: instruction fetching, instruction decoding, instruction execution, and writing back. In the instruction fetching phase, instructions are extracted from storage or a cache. In the instruction decoding phase, different control signals are generated according to the types of fetched instructions. In the instruction execution phase, operands are used to execute instructions in functional components (execution units) in accordance with the control signals generated from the decoding phrase. Lastly, in the write-back phase, execution results are written back into storage or a register.
Several CPU performance-improving techniques have improved CPU throughput. Examples of such performance-improving techniques include pipelining, superscalar techniques, and superscalar-pipelining techniques. These techniques have the following in common: they increase the concurrency of instruction execution by increasing the number of instructions executed within a single clock cycle and therefore increase CPU execution efficiency. However, in reality, a CPU generally makes use of limited system architecture registers (also called “ISA registers” or “general registers”) to save the operands and the results of executed instructions. Consequently, dependent relationships (also called “data dependency”) may exist between instructions. For example, two instructions are dependent because they use the same register. Such dependency between the instructions will restrict parallel execution of the instructions. To mitigate this problem, a register renaming phase is introduced between the decoding and execution phases of the CPU operating process. The main task of the renaming phase is to eliminate false dependence (also called “erroneous dependence”) between instructions with respect to register use. It is also necessary to screen for true dependence (also called “data dependency”) between instructions. Data dependency occurs if the value of the source operand to be used in an execution of a subsequent instruction originates from a destination operand that is produced by a prior executed instruction, for example. The register renaming phase can be solved through renaming list mapping. Screening for data dependency can be performed through comparative assessment using a renaming comparator.
In addition, to improve the utilization of CPU execution units, modern CPUs have been configured to perform simultaneous multithreading (SMT), which combines instruction-level parallelism with thread-level parallelism. By duplicating the architectural state of the processor, a single physical CPU may simultaneously execute two or more independent threads that share the processor's execution units. Since instruction streams coming from two or more threads contain more independent instructions capable of parallel execution, the execution units can be more effectively used and shared, which increases CPU throughput.
In light of the fact that a CPU that incorporates an SMT mechanism has more independent instruction streams, if front-end instruction widths can be expanded (including the instruction fetching, decoding, and renaming stages), it becomes possible to obtain even more parallel processing instructions and thus make even fuller use of multiple execution units. Existing solutions are available to increase the throughput of the instruction fetching and decoding stages. For example, Intel CPUs incorporate level 0 caches that are used to store instructions that have already been decoded. In this way, when a CPU needs instructions, it can directly acquire the needed instructions from the level 0 cache. At the same time, the width for acquiring instructions can be changed from 16 bytes to 32 bytes.
Conventionally, the number of hardware comparing units required for data dependency detection in executing instructions is the square of the number of renaming instructions in each cycle. Put another way, if a comparing unit were implemented using a set of hardware comparators and n is the number of renaming instructions that is required to be executed during each clock cycle, then the number of hardware comparators that is needed to perform data dependency detection is n×n−n. As such, conventionally, increasing the instruction width of the renaming phase requires a large increase in the number of hardware comparators that is to be used. Without the addition of more hardware, conventionally, the renaming phase in Intel CPUs that are configured with the SMT functionality is limited to four instructions per each clock cycle, which could decrease the throughput of parallel instruction computing. As for IBM CPUs with the SMT functionality, in order to increase the renaming width from four instructions to six instructions per clock cycle, conventionally, the number of hardware comparators would need to be increased from (4×4−4=12) to (6×6−6=30). Not only does this increase CPU hardware cost, but also it increases hardware complexity. A more efficient technique for expanding the instruction width in the renaming phase is needed.