1. Field of the Invention
This invention relates in general to the field of pipelined microprocessors, and more particularly to forwarding storehit data between stages of microprocessor pipelines.
2. Description of the Related Art
A microprocessor has an instruction pipeline that sequentially executes instructions from an application program in synchronization with a microprocessor clock. The instruction pipeline is divided into stages, each of which performs a specific task that is part of an overall operation that is directed by a programmed instruction. The programmed instructions in a software application program are executed in sequence by the microprocessor. As an instruction enters the first stage of the pipeline, certain tasks are accomplished. The instruction is then passed to subsequent stages for accomplishment of subsequent tasks. Following completion of a final task, the instruction completes execution and exits the pipeline. Execution of programmed instructions by a pipeline microprocessor is very much analogous to the manufacture of items on an assembly line.
Early pipeline microprocessors were not sophisticated enough to have multiple instructions executing in different pipeline stages at the same time. Consequently, a given instruction would be fetched from memory and would proceed through the various pipeline stages until it completed execution. Following this, a next instruction would proceed through the various stages through completion. And because early pipeline microprocessors had only a few pipeline stages, the inefficient utilization of stage resources was not deemed to be significant.
As uses for microprocessors started to proliferate, more stringent requirements began to be imposed, particularly with respect to speed. The obvious approach for increasing processing speed was to allow multiple instructions to simultaneously proceed down the pipeline at the same time. Clearly, this change increased instruction throughput because resources within each pipeline stage were more efficiently used. But with this change came a problem: What if one instruction executing in an early pipeline stage required a result that was yet to be provided by another instruction executing in a later pipeline stage? This situation is common to software programs; instructions that are close in proximity tend to perform tasks using the same operand. For example, a control algorithm may compute a true error signal value by adding a small number to a current error signal value and then comparing this value to some other signal that is input to the microprocessor. The structure of the algorithm is to add a first operand to a second operand to produce a result. The result is then tested to see if the computed value is tolerable. If not, then the first operand is added to the computed result to obtain a second result. The second result is tested, and so on. Even in this simple algorithm, it is evident that every other instruction utilizes the last computed result. When a given instruction executing in one stage of the pipeline requires a result that is to be provided by another instruction executing in a subsequent stage of the pipeline, the given instruction is referred to as a dependent instruction. This is because the operand required by the dependent instruction depends upon generation of the result by the other instruction.
To deal with dependent instructions, microprocessor designers added interlock logic to existing pipeline designs. The interlock logic spanned the stages of a microprocessor where the dependency situation could occur. During normal operation, non-dependent instructions were successively advanced through the pipeline stages in synchronization with the clock. When the interlock logic encountered a dependent instruction, it simply stalled execution of the dependent instruction by inserting slips into the pipeline, until the result required by the dependent instruction was generated and made available for retrieval as an operand.
In spite of the advances provided through interlock logic, demands for faster throughput continued to press microprocessor designers. Consequently, an alternative to interlock logic was developed that allowed dependent instructions to proceed down the pipeline without incurring slips. This alternative is known as a result distribution bus, or a bypass bus, or a result forward bus. In essence, the result distribution bus originates at the stage of the pipeline in which results are generated. When a result is generated, it is copied to the bus and then routed to all of the earlier stages that are affected by dependent instructions. If a dependent instruction is present within any of the earlier stages, then logic within that stage performs all of the operations necessary to properly configure the required operand from the provided result. The result distribution approach can be thought of as a one-to-many distribution scheme because one result can be distributed to several dependent instructions at the same time.
The result distribution scheme has prevailed as the principal technique for expediting the execution of dependent instructions, until more recent times, when demands for further throughput increases have compelled microprocessor designers to substantially alter the design of stages within the pipeline. These alterations to the pipeline can be comprehended through use of an assembly line analogy. Suppose an assembly line is set up with three stages, where each of the three stages is required to insert two screws in a product that flows down the line, for a total of six screws. Further, suppose that the time required to insert a screw is one minute. To send a product through the assembly line, then, requires six minutes. If multiple products are sent down the line, then it follows that one product rolls off the line every two minutes.
A simple enhancement to the line will double the production throughput: Reconfigure the line into six stages, where each stage is required to insert only one screw. While with this architectural change it still takes six minutes to pass a product through the line, the improvement now is that one product rolls off of the line every minute. The speed and throughput are doubled by doubling the number of stages and halving the operations performed in each stage.
To improve the throughput of current microprocessors, designers are taking the very same approach: pipeline stages are being added and the functional requirements for each stage are being decreased. Thus, faster clock speeds can be applied and instruction throughput is increased.
But increasing the number of pipeline stages has highlighted a deficiency with the result distribution technique for dealing with dependent instructions. Whereas early microprocessor pipelines consisted of only a few stages, the attendant logic required to implement a result distribution bus was not much of an issue. But for every added stage in the execution pipeline, an additional set of result distribution logic must be provided. In other words, the logic required to implement a result distribution bus is directly proportional to the number of stages that required distribution of the result. Add to this the fact that the presence of more stages requires logic elements that drive the result signals to be more powerful. Moreover, the timing to distribute results to multiple pipeline stages is not only a function of the number of stages, but is also a based upon the location of the stage that is physically farthest from the origination stage. Hence, circuit complexity, power, and timing problems arise when a result distribution scheme is applied to more advanced pipeline architectures that have more pipeline stages.
One situation in which one instruction executing in an higher pipeline stage requires a result generated by a preceding instruction executing in a lower pipeline stage is referred to as a storehit condition. A storehit condition exists when a load instruction requests data generated as the result of a store instruction executing ahead of the load instruction in the pipeline. That is, the load instruction specifies a load address for load data, wherein the load address matches a store address for store data specified by a previous store instruction, and the store data is still in the microprocessor pipeline, i.e., has not yet been updated in the microprocessor data cache.
In a storehit condition, the load instruction must be provided with coherent data, i.e., the newest data associated with the load address. Thus, the microprocessor cannot supply the data from its data cache or go to system memory to get the data since the newest data is within the pipeline and not in the data cache or system memory. Hence, either the microprocessor stalls to wait for the storehit data to be updated in the data cache, or it forwards the data using one of the techniques described above. Neither alternative is desirable for the reasons described above.
It has been observed that storehit conditions occur relatively frequently in modern microprocessors, particularly in x86 microprocessors. This phenomenon is largely attributed to the fact that modern compilers recognize the relatively small number of registers available in the x86 register file and the fact that virtually every contemporary x86 processor has a large built-in data cache that is essentially accessible at the same speed as the register file. Therefore, when the compilers run out of registers in the register file, they use the data cache as a huge register file. In particular, compilers have been observed to generate code that causes storehit conditions in the following situations: where a loop counter variable is stored in a memory location; where a memory location is used as a temporary location for a sequence of arithmetic operations; and where a stack location is accessed within a very short instruction sequence due to the calling of a very short subroutine, i.e., a push of the return address, followed by a jump to the subroutine, followed by a very small number of instructions of the subroutine, followed by a pop of the return address generating a storehit on the location of the return address.
When a store instruction executes, the microprocessor determines whether the data address specified by the store instruction, i.e., the store address, is present in the data cache of the microprocessor. That is, the microprocessor determines whether a cache miss occurred. If a cache miss occurs, then the microprocessor will schedule a write of the data to system memory. If the store address is to a cacheable line of memory, the microprocessor may perform a write-allocate operation, if the microprocessor is configured to do so. To perform a write-allocate operation, the microprocessor first fetches from the system memory the cache line of data implicated by the store data, i.e., the cache line in which the store address is included. The store data is then merged with the fetched cache line and the cache is updated with the merged data. Although the read of the cache line from memory may take longer than simply writing the store data to memory, performance gains may be realized by the fact that subsequent stores to the cache line will hit in the cache as a result of the write-allocate operation. Subsequent cache hits are probable because software memory accesses tend to occur within proximity of each other, commonly referred to as the locality of reference principle.
Therefore, what is needed is an apparatus in a pipeline microprocessor that provides high performance storehit detection and forwarding.
To address the above-detailed deficiencies, it is an object of the present invention to provide an apparatus that reduces the number of storehit pipeline stalls. Accordingly, in attainment of the aforementioned object, it is a feature of the present invention to provide an apparatus for forwarding storehit data within a pipelined microprocessor having a cache memory and coupled to a system memory. The apparatus includes at least one response buffer and forwarding logic. The response buffer receives data from the system memory and updates the cache in response to a store instruction miss of the cache. The response buffer also receives store results of the store instruction in response to the cache miss. The forwarding logic forwards the store results from the response buffer to a load instruction in the pipeline in response to a storehit condition between the load and store instructions.
An advantage of the present invention is that only minimal changes are required to forward storehit data from a response buffer. The present invention utilizes response buffers that already exist within the microprocessor to forward storehit data. Finally, the present invention also utilizes byte aligning logic associated with the response buffers typically already existing in the microprocessor to forward storehit data. By using already existing logic, the present invention advantageously allows for scaling of the number of pipeline stages without significantly increasing the microprocessor die size or introducing timing problems.
In another aspect, it is a feature of the present invention to provide an apparatus within a pipelined microprocessor for forwarding storehit data between stages of the microprocessor pipeline, the microprocessor being coupled to a bus for transferring data between the microprocessor and a memory. The apparatus includes at least one response buffer and forwarding logic. The response buffer has at least first and second inputs and an output. The first input is coupled to the bus. The second input is coupled to a first stage of the pipeline that provides store results. The output is coupled to a second stage of the pipeline that receives load instruction data. The forwarding logic forwards the store results from the response buffer to the second pipeline stage in response to detection of a storehit condition.
In yet another aspect, it is a feature of the present invention to provide an apparatus for forwarding storehit data from a response buffer within a pipelined microprocessor, wherein the response buffer is configured to receive data from a bus coupled to the microprocessor in response to store instruction misses of a data cache within the microprocessor. The apparatus includes a first multiplexer and a second multiplexer. The first multiplexer has an output coupled to an input of the response buffer, a first input for receiving the data from the bus, and a second input for receiving store instruction results from the microprocessor pipeline. The second multiplexer has a first input coupled to an output of the response buffer, a second input operatively coupled to the data cache, and an output coupled to the pipeline.
In yet another aspect, it is a feature of the present invention to provide a pipelined microprocessor coupled to a system memory. The microprocessor includes a data cache, at least one response buffer and forwarding logic coupled to the response buffer. The data cache caches data from the system memory. The response buffer receives data from the system memory for updating the cache and for storing at least one store instruction result. The forwarding logic forwards the store instruction result from the response buffer to a load instruction in the microprocessor pipeline in response to a storehit condition with the store instruction results generated by the load instruction.
In yet another aspect, it is a feature of the present invention to provide a method for forwarding storehit data in a pipelined microprocessor having a plurality of response buffers for receiving data from a system memory coupled to the microprocessor, the microprocessor having a cache memory. The method includes determining whether one of the plurality of response buffers is idle in response to detecting a miss of the cache miss generated by a store instruction, storing the store instruction results in the response buffer if the response buffer is idle, forwarding the store results to a load instruction in the pipeline in response to detecting a storehit condition with the store results generated by the load instruction.
In yet another aspect, it is a feature of the present invention to provide a method for forwarding storehit data in a pipelined microprocessor having a plurality of response buffers for receiving data from a system memory coupled to the microprocessor, the microprocessor having a cache memory. The method includes writing a store instruction result to one of the plurality of response buffers, detecting a storehit condition with respect to the store instruction result and a load instruction executing in the pipeline, and forwarding the store instruction result to the load instruction in response to detecting the storehit condition.