1. Field of the Invention
This invention relates in general to the field of instruction execution in a pipeline processing system, and more particularly to a method and apparatus which fast fills an instruction queue.
2. Description of the Related Art
Modern computer systems utilize a number of different processor architectures to perform program execution. In conventional microprocessor based systems, a computer program is made up of a number of macro instructions that are provided to the microprocessor for execution. The microprocessor decodes each macro instruction into a sequence of micro instructions, i.e., simple machine instructions that the hardware in the microprocessor can understand, and executes all of the micro instructions in the sequence before decoding another macro instruction.
In more advanced computer systems, another type of microprocessor, called a "pipeline" processor, is used. A pipeline processor decodes macro instructions, similar to those of a conventional microprocessor, into a sequence of micro instructions. However, the sequence of micro instructions are overlapped during execution to improve performance. Such overlapping of micro instructions during execution is known as "pipelining". Pipelining is a key implementation technique used to make fast microprocessors.
A pipeline is like an assembly line. Each step in a pipeline operates in parallel with other steps, though on a different micro instruction. Like the assembly line, different steps are completing different parts of a macro instruction in parallel. Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipe-instructions enter at one end, progress through the stages, and exit at the other end. For a more detailed discussion of pipelining, see Computer Architecture: A Quantitative Approach, by John L. Hennessy and David A. Patterson, 2.sup.nd ed.
The beginning stage of a pipeline processor is known as the "fetch" stage. In this stage, macro instructions are fetched from memory and placed into a buffer which feeds the next stage in the pipeline, typically the "translate/decode" stage. The translate/decode stage translates the macro instructions, one at a time, into a sequence of micro instructions, which are provided, one at a time, to an instruction register. The instruction register is a register which provides temporary storage for micro instructions. The instruction register provides the micro instructions, one at a time, to later stages in the pipeline, for execution.
Flow of instructions through a pipeline is typically controlled by a system clock, or processor clock signal. For example, during a first clock cycle, a first macro instruction may be fetched-from memory. By the end of the clock cycle, the first macro instruction is placed into the buffer which feeds the translate/decode stage. During a second clock cycle, a second macro instruction may be fetched and placed into the buffer. In addition, and in parallel to the second macro instruction fetch, the first macro instruction is "read" by the translate/decode logic, and translated into a sequence of micro instructions. By the end of the second clock cycle, the first micro instruction in the sequence is provided to the instruction register. During a third clock cycle, the first micro instruction is provided to later stages in the pipeline, and a second micro instruction is stored in the instruction register. This pipeline process continues indefinitely.
As long as the fetch stage of the pipeline continues to fetch macro instructions during each clock cycle, and as long as the translate/decode stage is able to translate the macro instructions into micro instructions, and provide a micro instruction to the instruction register during each clock cycle, then the pipeline of the processor stays full, at least with respect to the fetch, and translate/decode stages. However, in many instances, the fetch stage, or the translate/decode stage is not able to perform their tasks within the allotted time, i.e., within a single clock cycle. For example, the fetch stage may be required to fetch a macro instruction from a memory location which is not readily accessible. For reasons known to one skilled in the art, the macro instruction may not exist in the instruction cache (e.g., fast memory which temporarily stores instructions for the processor), but rather, the macro instruction may be located in system memory, or possibly even on permanent memory such as a hard disk. Thus, the fetch stage of the pipeline may require many clock cycles to retrieve the needed macro instruction. In a similar fashion, the translate/decode logic may not be able to completely decode a macro instruction and provide a micro instruction to the instruction register within a single clock cycle.
When stages in a pipeline are not able to complete their tasks within a single processor cycle, "holes" in the pipeline are created. As in an assembly line, when one of the stages in the line halts, it backs up all earlier stages in the line. However, later stages in the line continue to completion. In a pipeline processor, if the translate/decode stage cannot provide a micro instruction to the instruction register during a single clock cycle, but can provide the micro instruction to the instruction register during a second clock cycle, then a hole of one clock cycle now exists, between the instruction register, and later stages in the pipeline. When holes are created in a pipeline, performance of the processor is effected accordingly.
To overcome the performance problem associated with holes in a pipeline, a number of improvements have been made. One such improvement, alluded to above, is to place a high speed instruction cache close to the fetch stage of the processor. By utilizing sophisticated caching schemes, and by providing sufficient memory for the instruction cache, chances of needing an instruction that is not in the cache are reduced.
Another improvement for reducing holes in pipeline processors is to include an instruction queue within the translate/decode stage of a pipeline processor. The instruction queue is positioned in between the translate/decode logic and the instruction register, and is used to temporarily store more than one micro instruction at a time. Typical instruction queues may hold four, or even eight micro instructions. If we assume, for example, that an instruction queue contains micro instructions, and that during a particular clock cycle, the translate/decode logic is not able to provide a micro instruction to the instruction register, then the next micro instruction can be provided by the instruction queue. Then, during the next clock cycle, the translate/decode logic can provide the micro instruction to either the instruction register, or to the instruction queue, as needed. Thus, by using an instruction queue in between the translate/decode logic and the instruction register, a hole in the pipeline is prevented.
What has not yet been discussed, however, is what happens when the instruction queue is empty. As discussed above, the fetch stage is typically able to provide a macro instruction to the translate/decode stage within a single processor cycle. And, the translate/decode stage is able to decode the macro instruction and provide a micro instruction to the instruction register during a single processor cycle. As this process continues, there is no opportunity for an instruction queue to get ahead, so to speak, to be of any value to the pipeline. This is because later stages in the pipeline are continuing to demand the most recent micro instruction from the instruction register. And, the instruction register is continuing to demand the most recent micro instruction out of the translator. Unless the instruction queue can some how jump ahead of the instruction register, by at least one clock cycle, its contents are of no use.
To get ahead of the instruction register, and the later stages in the pipeline, the only way heretofore used to fill an instruction queue is to wait for a "stall" to occur in later stages of the pipeline. For example, one of the later stages in a pipeline is the data stage. During the data stage, either an ALU operation is executed, or data is retrieved from memory. If a memory retrieval operation occurs, and this operation requires more than one clock cycle to execute, then all earlier stages in the pipeline are stalled, or halted. When the data stage creates a stall in the pipeline, the instruction queue takes advantage of the stall by filling one of its instruction buffers with a micro instruction from the translate/decode logic. If the data stage requires two clock cycles for execution, the instruction queue can get ahead of the pipeline by one instruction. If the data stage requires three clock cycles for execution, the instruction queue can get ahead of the pipeline by two instructions, etc. When the stall condition ceases, micro instructions may be provided by the instruction queue, and the translate/decode logic can continue to fill the queue. If a hole occurs in the fetch stage or the translate/decode stage, the hole is then filled by the instruction queue, as discussed above.
A problem with the above technique is that it is dependent on stalls in the pipeline to allow opportunity for the instruction queue to be filled. Thus, in instances where the instruction queue is empty, either at the start of a program, or on program branches, the queue is useless at filling holes until stalls occur in later stages of the pipeline. The instruction queue must be filled to be of any benefit in preventing holes in a pipeline. What is needed is a method for filling the instruction queue without having to wait for stalls in later pipeline stages.