1. Field of the Invention
This invention relates in general to the field of instruction execution in computers, and more particularly to an apparatus in a pipeline microprocessor for providing results of recently executed micro instructions to dependent micro instructions.
2. Description of the Related Art
A microprocessor has an instruction pipeline that sequentially executes instructions from an application program in synchronization with a microprocessor clock. The instruction pipeline is divided into stages, each of which performs a specific task that is part of an overall operation that is directed by a programmed instruction. The programmed instructions in a software application program are executed in sequence by the microprocessor. As an instruction enters the first stage of the pipeline, certain tasks are accomplished. The instruction is then passed to subsequent stages for accomplishment of subsequent tasks. Following completion of a final task, the instruction completes execution and exits the pipeline. Execution of programmed instructions by a pipeline microprocessor is very much analogous to the manufacture of items on an assembly line.
Early pipeline microprocessors were not sophisticated enough to have multiple instructions executing in different pipeline stages at the same time. Consequently, a given instruction would be fetched from memory and would proceed through the various pipeline stages until it completed execution. Following this, a next instruction would proceed through the various stages through completion. And because early pipeline microprocessors had only a few pipeline stages, the inefficient utilization of stage resources was not deemed to be significant.
As uses for microprocessors started to proliferate, more stringent requirements began to be imposed, particularly with respect to speed. And the obvious approach for increasing processing speed was to allow multiple instructions to simultaneously proceed down the pipeline at the same time. Clearly, this change increased instruction throughput because resources within each pipeline stage were more efficiently used. But with this change came a problem: What if one instruction executing in an early pipeline stage required a result that was yet to be provided by another instruction executing in a later pipeline stage? This situation is common to software programs; instructions that are close in proximity tend to perform tasks using the same operand. For example, a control algorithm may compute a true error signal value by adding a small number to a current error signal value and then comparing this value to some other signal that is input to the microprocessor. The structure of the algorithm is to add a first operand to a second operand to produce a result. The result is then tested to see if the computed value is tolerable. If not, then the first operand is added to the computed result to obtain a second result. The second result is tested. And so on. Even in this simple algorithm it is evident that every other instruction utilizes the last computed result. When a given instruction executing in one stage of the pipeline requires a result that is to be provided by another instruction executing in a subsequent stage of the pipeline, the given instruction is referred to as a dependent instruction. This is because the operand required by the dependent instruction depends upon generation of the result by the other instruction.
To deal with dependent instructions, microprocessor designers added interlock logic to existing pipeline designs. The interlock logic spanned the stages of a microprocessor where the dependency situation could occur. During normal operation, non-dependent instructions were successively advanced through the pipeline stages in synchronization with the clock. When the interlock logic encountered a dependent instruction, it simply stalled execution of the dependent instruction by inserting slips into the pipeline, until the result required by the dependent instruction was generated and made available for retrieval as an operand.
In spite of the advances provided through interlock logic, demands for faster throughput continued to press microprocessor designers. Consequently, an alternative to interlock logic was developed that allowed dependent instructions to proceed down the pipeline without incurring slips. This alternative is known as a result distribution bus, or a bypass bus, or a result forward bus. In essence, the result distribution bus originates at the stage of the pipeline in which results are generated. When a result is generated, it is copied to the bus and then routed to all of the earlier stages that are affected by dependent instructions. If a dependent instruction is present within any of the earlier stages, then logic within that stage performs all of the operations necessary to properly configure the required operand from the provided result. The result distribution approach can be thought of as a one-to-many distribution scheme because one result can be distributed to several dependent instructions at the same time.
The result distribution scheme has prevailed as the principal technique for expediting the execution of dependent instructions, that is, until more recent times when demands for further throughput increases have compelled microprocessor designers to substantially alter the design of stages within the pipeline. These alterations to the pipeline can be comprehended through use of an assembly line analogy. Suppose an assembly line is set up with three stages, where each of the three stages is required to insert two screws in a product that flows down the line, for a total of six screws. Further suppose that the time required to insert a screw is one minute. To send a product through the assembly line, then, requires six minutes. If multiple products are sent down the line, then it follows that one product rolls off the line every two minutes.
A simple enhancement to the line will double the production throughput: Reconfigure the line into six stages, where each stage is required to insert only one screw. While with this architectural change it still takes six minutes to pass a product through the line, the improvement now is that one product rolls off of the line every minute. The speed and throughput are doubled by doubling the number of stages and halving the operations performed in each stage.
To improve the throughput of current microprocessors, designers are taking the very same approach: pipeline stages are being added and the functional requirements for each stage are being decreased. Thus, faster clock speeds can be applied and instruction throughput is increased.
But increasing the number of pipeline stages has highlighted a deficiency with the result distribution technique for dealing with dependent instructions. Whereas early microprocessor pipelines consisted of only a few stages, the attendant logic required to implement a result distribution bus was not much of an issue. But for every added stage in the execution pipeline, an additional set of result distribution logic must be provided. In other words, the logic required to implement a result distribution bus is directly proportional to the number of stages that required distribution of the result. Add to this the fact that more stages requires that logic elements to drive the result signals must be more powerful. Moreover, the timing to distribute results to multiple pipeline stages is not only a function of the number of stages, but is also a based upon the location of the stage that is physically farthest from the origination stage. Hence, circuit complexity, power, and timing problems arise when a result distribution scheme is applied to more advanced pipeline architectures that have more pipeline stages.
Therefore, what is needed is an apparatus in a pipeline microprocessor that allows dependent instructions to execute without delay, but which is not adversely affected by the number of stages in the microprocessor pipeline.
In addition, what is needed is a mechanism to provide dependent micro instructions with generated results that does not require additional sets of logic, or exhibit timing problems when employed in a pipeline microprocessor having multiple pipeline stages.
Furthermore, what is needed is an apparatus in a pipeline microprocessor for temporarily storing several intermediate results that can be accessed in a single pipeline stage by a dependent micro instruction.
To address the above-detailed deficiencies, it is an object of the present invention to provide a mechanism for expeditiously executing dependent instructions that can adapt, without adverse hardware, power consumption, or timing consequences, to advanced pipeline architectures having more pipeline stages. Accordingly, in the attainment of the aforementioned object, it is a feature of the present invention to provide a microprocessor apparatus for providing intermediate results to a dependent micro instruction. The microprocessor apparatus includes a result forwarding cache, result update logic, and operand configuration logic. The result forwarding cache stores the intermediate results, where the intermediate results are results that have been generated by preceding micro instructions, but which have not yet been written to corresponding operand registers in a register file for retrieval by following micro instructions. The result forwarding cache has cache buffers, where each of the cache buffers correspond to each of the intermediate results. Each of the cache buffers has a result field for storing a corresponding intermediate result generated by execution of a corresponding preceding micro instruction, a tag field for identifying a corresponding operand register in the register file into which the corresponding intermediate result is to be entered, and a byte enable field for designating bytes in the result field that are valid when the corresponding intermediate result is entered into the result forwarding cache. The result update logic is coupled to the result forwarding cache and enters the intermediate results into the result forwarding cache. The operand configuration logic is coupled to the result forwarding cache. The operand configuration logic accesses the intermediate results and provides an operand required by the dependent micro instruction, where the operand configuration logic configures the operand from the intermediate results in the result forwarding cache and provides the operand to the dependent micro instruction prior to execution of the dependent micro instruction. An advantage of the present invention is that only minimal changes are required to provide results to dependent instructions when pipeline stages are added. Another object of the present invention is to provide an apparatus for executing dependent instructions without delay that is less complex than has heretofore been provided. In another aspect, it is a feature of the present invention to provide an intermediate result cache for storing intermediate results of previously executed micro instructions, the intermediate results being entered into the intermediate result cache prior to being entered into a register file. The intermediate result cache has result buffers, tag buffers, valid byte indicators, a result selector, and operand configuration logic. The result buffers store the intermediate results. The tag buffers are coupled to the result buffers, each of the tag buffers designating a corresponding register in the register file within which a corresponding intermediate result is to be entered. The valid byte indicators are coupled to the result buffers, each of the valid byte indicators indicating which bytes in a corresponding intermediate result buffer are valid upon entry of the corresponding intermediate result. The result selector is coupled to the tag buffers and the valid byte indicators. The result selector determines selected byte locations within selected result buffers that are used to configure an operand for a dependent micro instruction. The operand configuration logic is coupled to the result selector and the result buffers. The operand configuration logic retrieves bytes from the selected byte locations within the selected result buffers to configure the operand.
Another advantage of the present invention is that a mechanism is provided that is suited for multi-stage pipeline designs.
A further object of the invention is to provide a mechanism for accessing results in a single pipeline stage whereby a dependent instruction can execute without incurring slips.
In a further aspect, it is a feature of the present invention to provide an apparatus in a pipeline microprocessor for providing an operand to a dependent micro instruction. The apparatus includes execution logic, result cache entries, result update logic, and operand configuration logic. The execution logic generates intermediate results of micro instructions that precede the dependent micro instruction. The result cache entries store the intermediate results. The result update logic is coupled to the execution logic and the result cache entries. The result update logic enters a specific intermediate result into a specific result cache entry following generation of the specific intermediate result by the execution logic and prior to the specific intermediate result being written to a specific destination register in a register file. The operand configuration logic is coupled to the result cache entries and selects bytes from selected result cache entries to configure the operand.
A further advantage of the present invention is that provision of results to dependent instructions can be accomplished in more advanced pipeline microprocessors without incurring problems related to circuit complexity, routing, power, or timing.