1. Field of the Invention
This invention relates in general to the field of pipelined microprocessors, and more particularly to forwarding storehit data between stages of microprocessor pipelines.
2. Description of the Related Art
A microprocessor has an instruction pipeline that sequentially executes instructions from an application program in synchronization with a microprocessor clock. The instruction pipeline is divided into stages, each of which performs a specific task that is part of an overall operation that is directed by a programmed instruction. The programmed instructions in a software application program are executed in sequence by the microprocessor. As an instruction enters the first stage of the pipeline, certain tasks are accomplished. The instruction is then passed to subsequent stages for accomplishment of subsequent tasks. Following completion of a final task, the instruction completes execution and exits the pipeline. Execution of programmed instructions by a pipeline microprocessor is very much analogous to the manufacture of items on an assembly line.
Early pipeline microprocessors were not sophisticated enough to have multiple instructions executing in different pipeline stages at the same time. Consequently, a given instruction would be fetched from memory and would proceed through the various pipeline stages until it completed execution. Following this, a next instruction would proceed through the various stages through completion. And because early pipeline microprocessors had only a few pipeline stages, the inefficient utilization of stage resources was not deemed to be significant.
As uses for microprocessors started to proliferate, more stringent requirements began to be imposed, particularly with respect to speed. The obvious approach for increasing processing speed was to allow multiple instructions to simultaneously proceed down the pipeline at the same time. Clearly, this change increased instruction throughput because resources within each pipeline stage were more efficiently used. But with this change came a problem: What if one instruction executing in an early pipeline stage required a result that was yet to be provided by another instruction executing in a later pipeline stage?
This situation is common to software programs; instructions that are close in proximity tend to perform tasks using the same operand. For example, a control algorithm may compute a true error signal value by adding a small number to a current error signal value and then comparing this value to some other signal that is input to the microprocessor. The structure of the algorithm is to add a first operand to a second operand to produce a result. The result is then tested to see if the computed value is tolerable. If not, then the first operand is added to the computed result to obtain a second result. The second result is tested, and so on. Even in this simple algorithm, it is evident that every other instruction utilizes the last computed result. When a given instruction executing in one stage of the pipeline requires a result that is to be provided by another instruction executing in a subsequent stage of the pipeline, the given instruction is referred to as a dependent instruction. This is because the operand required by the dependent instruction depends upon generation of the result by the other instruction.
To deal with dependent instructions, microprocessor designers added interlock logic to existing pipeline designs. The interlock logic spanned the stages of a microprocessor where the dependency situation could occur. During normal operation, non-dependent instructions were successively advanced through the pipeline stages in synchronization with the clock. When the interlock logic encountered a dependent instruction, it simply stalled execution of the dependent instruction by inserting slips into the pipeline, until the result required by the dependent instruction was generated and made available for retrieval as an operand.
In spite of the advances provided through interlock logic, demands for faster throughput continued to press microprocessor designers. Consequently, an alternative to interlock logic was developed that allowed dependent instructions to proceed down the pipeline without incurring slips. This alternative is known as a result distribution bus, or a bypass bus, or a result forward bus. In essence, the result distribution bus originates at the stage of the pipeline in which results are generated. When a result is generated, it is copied to the bus and then routed to all of the earlier stages that are affected by dependent instructions. If a dependent instruction is present within any of the earlier stages, then logic within that stage performs all of the operations necessary to properly configure the required operand from the provided result. The result distribution approach can be thought of as a one-to-many distribution scheme because one result can be distributed to several dependent instructions at the same time.
The result distribution scheme has prevailed as the principal technique for expediting the execution of dependent instructions, until more recent times, when demands for further throughput increases have compelled microprocessor designers to substantially alter the design of stages within the pipeline. These alterations to the pipeline can be comprehended through use of an assembly line analogy. Suppose an assembly line is set up with three stages, where each of the three stages is required to insert two screws in a product that flows down the line, for a total of six screws. Further suppose that the time required to insert a screw is one minute. To send a product through the assembly line, then, requires six minutes. If multiple products are sent down the line, then it follows that one product rolls off the line every two minutes.
A simple enhancement to the line will double the production throughput: Reconfigure the line into six stages, where each stage is required to insert only one screw. While with this architectural change it still takes six minutes to pass a product through the line, the improvement now is that one product rolls off of the line every minute. The speed and throughput are doubled by doubling the number of stages and halving the operations performed in each stage.
To improve the throughput of current microprocessors, designers are taking the very same approach: pipeline stages are being added and the functional requirements for each stage are being decreased. Thus, faster clock speeds can be applied and instruction throughput is increased.
But increasing the number of pipeline stages has highlighted a deficiency with the result distribution technique for dealing with dependent instructions. Whereas early microprocessor pipelines consisted of only a few stages, the attendant logic required to implement a result distribution bus was not much of an issue. But for every added stage in the execution pipeline, an additional set of result distribution logic must be provided. In other words, the logic required to implement a result distribution bus is directly proportional to the number of stages that required distribution of the result. Add to this the fact that the presence of more stages requires logic elements that drive the result signals to be more powerful. Moreover, the timing to distribute results to multiple pipeline stages is not only a function of the number of stages, but is also a based upon the location of the stage that is physically farthest from the origination stage. Hence, circuit complexity, power, and timing problems arise when a result distribution scheme is applied to more advanced pipeline architectures that have more pipeline stages.
One situation in which one instruction executing in an higher pipeline stage requires a result generated by a preceding instruction executing in a lower pipeline stage is referred to as a storehit condition. A storehit condition exists when a load instruction requests data generated as the result of a store instruction executing ahead of the load instruction in the pipeline. That is, the load instruction specifies a load address for load data, wherein the load address matches a store address for store data specified by a previous store instruction, and the store data is still in the microprocessor pipeline, i.e., has not yet been updated in the microprocessor data cache.
In a storehit condition, the load instruction must be provided with coherent data, i.e., the newest data associated with the load address. Thus, the microprocessor cannot supply the data from its data cache or go to system memory to get the data since the newest data is within the pipeline and not in the data cache or system memory. Hence, either the microprocessor stalls to wait for the storehit data to be updated in the data cache, or it forwards the data using one of the techniques described above. Neither alternative is desirable for the reasons described above.
It has been observed that storehit conditions occur relatively frequently in modern microprocessors, particularly in x86 microprocessors. This phenomenon is largely attributed to the fact that modern compilers recognize the relatively small number of registers available in the x86 register file and the fact that virtually every contemporary x86 processor has a large built-in data cache that is essentially accessible at the same speed as the register file. Therefore, when the compilers run out of registers in the register file, they use the data cache as a huge register file. In particular, compilers have been observed to generate code that causes storehit conditions in the following situations: where a loop counter variable is stored in a memory location; where a memory location is used as a temporary location for a sequence of arithmetic operations; and where a stack location is accessed within a very short instruction sequence due to the calling of a very short subroutine, i.e., a push of the return address, followed by a jump to the subroutine, followed by a very small number of instructions of the subroutine, followed by a pop of the return address generating a storehit on the location of the return address.
The storehit data may be forwarded from within the microprocessor pipeline if the data is cacheable. However, microprocessors provide a means for accessing I/O devices, such as DMA controllers, disk controllers, modems, etc. I/O devices typically include status and control registers that are read and written by the microprocessor. Some microprocessors, such as x86 processors, have dedicated instructions for accessing the registers of I/O devices, such as the x86 “in” and “out” instructions. These instructions address a separate address space of the processor bus, namely the I/O space.
The other way I/O devices are accessed is by mapping them into the memory address space of the processor. Such an I/O device is referred to as a memory-mapped I/O device and the region in which the I/O device is mapped is referred to as a memory-mapped I/O region. That is, an I/O device resides within a portion of the microprocessor memory address space, thereby enabling system software, such as device drivers, to access the I/O device.
Typically, memory-mapped I/O regions are non-cacheable. That is, stores to a memory-mapped I/O region may not be cached to satisfy a load from the same location in the memory-mapped I/O region. An example of a non-cacheable memory-mapped I/O access occurs when a first instruction writes a value to an I/O register and the next instruction reads from an I/O register on the same device, such as a store to a memory-mapped I/O region followed by a load from the same memory-mapped I/O region. Due to the nature of I/O devices, in order to insure proper operation of the I/O device, the two instructions must be guaranteed to execute in order. That is, the read cannot be executed until the write has completed to the I/O device. Thus, a microprocessor must ensure that it does not forward storehit data within the microprocessor where the load/store address is in a non-cacheable region.
A final situation relevant to storehit forwarding is referred to as virtual aliasing. Virtual aliasing arises from the fact that most modern microprocessors support the notion of virtual memory. That is, the microprocessor supports a virtual address space that is much larger than the physical address space that the processor is capable of addressing. Consequently, a situation may arise where two different virtual addresses map, or translate, to a single physical address. In particular, a virtual aliasing condition occurs when a match is detected based on a comparison of the physical load and store addresses, but a match is not detected based on a comparison of the virtual load and store addresses.
Therefore, what is needed is an apparatus in a pipelined microprocessor that provides high performance storehit detection and forwarding, taking into account non-cacheable region accesses and virtual aliasing conditions.