The invention generally relates to processors, and in particular to multi-threading techniques for a processor utilizing a replay queue.
The primary function of most computer processors is to execute a stream of computer instructions that are retrieved from a storage device. Many processors are designed to fetch an instruction and execute that instruction before fetching the next instruction. Therefore, with these processors, there is an assurance that any register or memory value that is modified or retrieved by a given instruction will be available to instructions following it. For example, consider the following set of instructions:
1) Load memory-1xe2x86x92register-X;
2) Add1 register-X register-Yxe2x86x92register-Z;
3) Add2 register-Y register-Zxe2x86x92register-W.
The first instruction loads the content of memory-1 into register-X. The second instruction adds the content of register-X to the content of register-Y and stores the result in register-Z. The third instruction adds the content of register-Y to the content of register-Z and stores the result in register-W. In this set of instructions, instructions 2 and 3 are considered xe2x80x9cdependentxe2x80x9d instructions that are dependent on instruction 1. In other words, if register-X is not loaded with valid data in instruction 1 before instructions 2 and 3 are executed, instructions 2 and 3 will generate improper results. With the traditional xe2x80x9cfetch and executexe2x80x9d processors, the second instruction will not be executed until the first instruction has properly executed. For example, the second instruction may not be dispatched to the processor until a cache hit/miss signal is received as a result of the first instruction. Further, the third instruction will not be dispatched until an indication that the second instruction has properly executed has been received. Therefore, it can be seen that this short program cannot be executed in less time than T=L1+L2+L3, where L1, L2 and L3 represent the latency of the three instructions. Hence, to ultimately execute the program faster, it will be necessary to reduce the latencies of the instructions.
Therefore, there is a need for a computer processor that can schedule and execute instructions with improved speed to reduce latencies.
According to an embodiment of the present invention, a processor is provided that includes an execution unit to execute instructions and a replay system coupled to the execution unit to replay instructions which have not executed properly. The replay system includes a checker to determine whether each instruction has executed properly and a plurality of replay queues. Each replay queue is coupled to the checker to temporarily store one or more instructions for replay.