The present invention relates generally to processors, and, more particularly, to a processor having an instruction set architecture (ISA) with decomposing operands.
In the field of processors it is common to execute instructions in an “in-order” sequence. That is, the instruction is fetched, and if the input operands are all available, e.g., in registers, the instruction is dispatched to the appropriate functional unit of the processor for execution thereby. If one or more of the operands are unavailable during the current clock cycle, e.g., because they are being fetched from memory, the processor pauses operation or stalls until all of the operands are available. Once all operands are available, the instruction is executed by the appropriate functional unit, which then writes the results back to the register file.
It is also known to execute instructions in an “out-of-order” sequence. That is, after the instruction is fetched, the instruction waits in a queue until all of the input operands are available. When available, the instruction is allowed to leave the queue and is issued to the appropriate functional unit where it is executed.
Out-of-order processing allows the processor to avoid a class of processor stalls that occur when the data (i.e., operands) needed to perform an operation are not all available to the processor. An out-of-order processor fills the processor stall periods with other instructions that are ready to be executed, then re-orders the results to make it appear that the instructions were processed as normal. The benefits of out-of-order processing increase as the instruction pipeline deepens and the speed difference between main memory or cache memory and the processor widens. On a typical modern computer, the processor runs many times faster than the memory. Thus, during the time that an in-order processor spends waiting for the operand data to arrive for processing, an out-of-order processor instead could have processed a larger number of instructions.
In addition, it is known for processors to support simultaneous multithreading (SMT), which is a technique for improving the overall efficiency of processors. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. In SMT, instructions from more than one thread can be executed in any given pipeline stage at a time. This is done without relatively large changes to the basic processor architecture. The main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. Most SMT implementations typically have two threads.
A state-of-the-art, in- or out-of-order processor typically utilizes fixed architectural or architected operands, for example, 64 architected registers for each thread. This leads to 256 architected registers in the case of a four-way SMT processor with an instruction set that defines 64 registers per thread. In an out-of-order processor, the rename space also increases the demand for registers. This leads to a register file in the processor with a relatively large amount of entries. The register file bandwidth—read and write—is limited; thus, instructions in the issue queues that are ready for execution may be discarded. With the increase of entries in a register file, due to its design the available read and write ports may be limited to enable functionality of the register file. The compiler generally has relatively good knowledge of register usage. For example, a “memfree” command may be used to free up system memory.
However, with these techniques there is no way to let the hardware know that a register will no longer be needed. The available hardware is typically capable for a worst case scenario, but the hardware in general is not used as efficiently as possible. That is, every instruction writes its results back into the register file. Issue slots may be wasted because of register file limitations, which is in addition to limited issue or read bandwidth. It is assumed that all threads executed simultaneously will need 64 architected plus rename registers. All intermediate results and operands are typically used only once and are saved until overwritten, while rename buffers are kept until completion.