1. Field of the Present Invention
The present invention generally relates to the field of microprocessors and more particularly to a microprocessor architecture supporting a variable cycle instruction reject delay to improve processor performance.
2. History of Related Art
The speed of high performance superscalar microprocessors (processors), measured in terms of the frequency of the processor""s clock signal, is rapidly migrating from the MHz range to the GHz range. As cycle times decrease with ever increasing clock rates, the number of levels of logic allowable in the design of any pipeline stage is extremely limited. These limited number of logic levels must be optimized to accomplish the most common tasks within the time limits imposed by the operating frequency. As an example, the pipeline of a processor""s load/store unit (LSU) must be capable of successfully completing a load instruction in each cycle as long as the load instructions hit in the processor""s L1 cache. Inevitably, however, less frequently occurring conditions cannot be resolved within the timing constraints imposed by the system. In a conventional processor, the determination of whether to reject an instruction is made when the instruction is in a final stage (the finish stage) of the pipeline. If, for any number of reasons, the functional unit in which the instruction is executing lacks sufficient information to determine that the instruction should be completed when the instruction reaches the finish stage, the instruction must be rejected. Thus, it will be appreciated that conventionally designed processors typically employ a fixed timing reject mechanism in which the reject decision is made a predetermined and non-varying number of cycles after the instruction issues.
Turning to FIG. 3, a timing diagram illustrating the operation of a fixed timing reject mechanism of a conventional processor is presented. In cycle 1 of the timing diagram, an instruction indicated by reference numeral 301 is issued and begins to flow through the pipeline. If the instruction contains a reference to a location in memory, the processor must initiate the process of determining whether valid data for the referenced memory address is available in the processor""s L1 data cache. This process may include an address translation component, in which the address recited in the instruction (the effective address) is translated to an address corresponding to a physical memory location (the real address) and an L1 cache retrieval component, in which the address tags of the L1 cache are compared against the address of the memory reference and data returned form the L1 cache. In the depicted example, a miss signal 303 is asserted to indicate that the data retrieval process failed to complete successfully. The miss signal 303 may reflect a variety of conditions that caused the instruction not to complete successfully. In one case, as an example, miss signal 303 may indicate that the effective to real address translation (ERAT) process could not complete in the time it takes instruction 301 to propagate through the pipeline. When this occurs, the processor must initiate a relatively time consuming retrieval of address translation information. Because the address translation information is not available when instruction 301 arrives at the finish stage in cycle 6, a reject signal indicated in FIG. 3 by reference numeral 307 is asserted. In response to reject signal 307, the processor reissues instruction 301 in the next cycle (cycle 7) and the instruction begins to propagate through the pipeline again. If the number of cycles required to retrieve the address translation information initiated by miss signal 303 is greater than the depth of the pipeline (in stages), the address translation information will not be available when instruction 301 reaches the finish stage for a second time in cycle 12. Accordingly, the instruction is rejected in cycle 12 and reissued for a third time in cycle 13. When instruction 301 reaches the finish stage in cycle 18, the necessary translation information has had sufficient time to be retrieved and the instruction can complete successfully. Because a reject decision had to be made as soon as the instruction reached the finish stage of the pipeline, instruction 301 was rejected twice and was required to travel the LSU pipeline three times. More generally it can be said that the fixed timing reject mechanism of conventional processors forces an all-or-nothing decision when an instruction reaches the finish stage of a pipeline. If any information or resource necessary to complete the instruction is unavailable in the cycle that the instruction reaches the finish stage, the instruction is rejected. Moreover, whenever an instruction is rejected, completion of that instruction will be delayed by at least the number of stages in the pipeline. If a pipeline includes six stages, an instruction that is rejected in cycle X cannot complete until, at the earliest, cycle X+6. If the instruction is rejected again in cycle X+6, the next earliest cycle in which the instruction could complete would be cycle X+12 and so forth. In other words, one can think of the processor as having an xe2x80x9cinstruction periodxe2x80x9d or xe2x80x9cinstruction cyclexe2x80x9d that is equal to the number of pipeline stages in the processor. In a conventional, fixed timing reject processor, the reject decision is made at the end of each instruction period. It will be appreciated, however, that in some cases, the information or resource that is lacking at the time an instruction reaches its decision point (i.e., the finish stage) may be available before the end of the next instruction period. In this case, performance is negatively impacted because the architecture inhibits completion of the result until the end of the next instruction period. As an example, consider a processor with a six cycle instruction period in which the retrieval of address translation information (when the information is not immediately available in an address translation cache) requires ten cycles and the retrieval process is not initiated until the fifth cycle of the instruction period, when the processor determines that the address translation information is not locally available (i.e., is not cached). If the retrieval of the address translation process is initiated in cycle 5, it will not be available until cycle 15, which falls in the middle of an instruction cycle. In this case, completion of the instruction is again delayed for the number of cycles between the time when all information is available to complete the instruction (cycle 15 in the example) and the end of the next instruction cycle (cycle 18). Therefore, it would be beneficial to implement an architecture that eliminated the performance penalty resulting from the constraint of requiring a reject decision in the cycle when an instruction reaches the finish stage.
The problems identified above are in large part address by a processor implementing a delayed reject mechanism. The processor includes an issue unit suitable for issuing an instruction in a first cycle and a load store unit. The load store unit includes an extend reject calculator circuit configured to receive a set of completion information signals and to generate a delay value based thereon. The LSU is adapted to determine whether to reject the instruction in a determination cycle. The number of cycles between the first cycle and the determination cycle is a function of the delay value such that reject timing is variable with respect to the first cycle. In one embodiment, the processor is further configured to reissue the instruction after the determination cycle if the instruction was rejected in the determination cycle. The delay value is conveyed via a 2-bit bus in one embodiment. The 2-bit bus permits delaying the determination cycle from 0 to 3 cycles after the finish cycle. In one embodiment, the number of cycles between the first cycle and the determination cycle includes the number of cycles required to travel a pipeline of the microprocessor plus the number of cycles indicated by the delay value.