This invention relates in general to the field of store forwarding, and more particularly to store forwarding in microprocessors supporting paged memory.
It is common for modern microprocessors to operate on several instructions at the same time, within different blocks or pipeline stages of the microprocessor. Hennessy and Patterson define pipelining as, xe2x80x9can implementation technique whereby multiple instructions are overlapped in execution.xe2x80x9d Computer Architecture: A Quantitative Approach, 2nd edition, by John L. Hennessy and David A. Patterson, Morgan Kaufmann Publishers, San Francisco, Calif., 1996. The authors go on to provide the following excellent illustration of pipelining:
xe2x80x9cA pipeline is like an assembly line. In an automobile assembly line, there are many steps, each contributing something to the construction of the car. Each step operates in parallel with the other steps, though on a different car. In a computer pipeline, each step in the pipeline completes a part of an instruction. Like the assembly line, different steps are completing different parts of the different instructions in parallel. Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipexe2x80x94instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line.xe2x80x9d
Thus, as instructions are fetched, they are introduced into one end of the pipeline. They proceed through pipeline stages within a microprocessor until they complete execution. However, as instructions proceed through the pipeline stages, an instruction executing in an early, or upper, pipeline stage may require a result of another instruction executing ahead of it in a later, or lower, pipeline stage.
One situation in which one instruction executing in a higher pipeline stage requires a result generated by a preceding instruction executing in a lower pipeline stage is referred to as a storehit condition. A storehit condition exists when a load instruction requests store data specified by a store instruction executing ahead of the load instruction in the pipeline. That is, the load instruction specifies a load address for load data, wherein the load address matches a store address for the store data specified by the store instruction issued previous to the load instruction, and the store data is still in the microprocessor pipeline, i.e., has not yet been updated in the microprocessor data cache or written to system memory.
It has been observed that storehit conditions occur relatively frequently in modern microprocessors, particularly in x86 microprocessors. This phenomenon is largely attributed to the fact that modern compilers recognize the relatively small number of registers available in the x86 register file and the fact that virtually every contemporary x86 processor has a large built-in data cache that is essentially accessible at the same speed as the register file. Therefore, when the compilers run out of registers in the register file, they use the data cache as a huge register file. In particular, compilers have been observed to generate code that causes storehit conditions in the following situations.
First, a loop counter variable is stored in a memory location. Second, a memory location is used as a temporary location for a sequence of arithmetic operations. Third, a stack location is accessed within a very short instruction sequence due to the calling of a very short subroutine. That is, a return address is pushed, followed by a jump to the subroutine, followed by a very small number of instructions of the subroutine, followed by a pop of the return address generating a storehit on the location of the return address.
In a storehit condition, the load instruction must be provided with coherent data, i.e., the newest data associated with the load address. Thus, the microprocessor cannot supply the data from its data cache or go to system memory to get the data since the newest data is within the pipeline and not in the data cache or system memory. One solution is for the microprocessor to stall and wait for the storehit data to be updated in the data cache or system memory, and then provide the data to the load instruction from the data cache or system memory. However, this solution has obvious performance disadvantages. A higher performance solution is to determine the newest data matching the load address, and to forward the newest data from the stage in which the store is pending to the load instruction stage.
Forwarding storehit data is complicated by the fact that many microprocessors use a paged memory scheme. In a paged memory scheme, virtual addresses of load and store instructions must be translated into physical addresses in order to access memory properly. In order to detect a storehit and to forward the proper data, the physical address of the load must be compared with the physical addresses of the stores pending in the processor. Comparing virtual addresses will not suffice since the load and stores could have different virtual addresses and yet still be referring to the same physical address in a paged memory system.
Paging microprocessors typically employ a translation-lookaside buffer (TLB) to cache physical addresses previously translated from virtual addresses. The virtual address is provided to the TLB, which looks up the virtual address and provides the translated physical address of the virtual address if the physical address is cached in the TLB. The TLB improves data access time by avoiding having to repeat the lengthy task of translating a virtual address to its physical address for recently accessed data.
In order to detect a storehit condition, the physical address of the load instruction is compared with the physical address of the pending stores in the pipeline. If a storehit occurs, the newest storehit data is forwarded to the load instruction. Presently, the TLB lookup, the physical address comparison and the data forwarding are serialized. The serialized time of these operations may be the critical path for processor cycle timing purposes. Therefore, what is needed is a method for reducing the serialized time in order to reduce processor cycle time and thereby improve processor performance.
The present invention provides a method and apparatus in a paging microprocessor for reducing store forwarding time by speculatively forwarding based on a physical page index comparison of a load and pending stores rather than waiting to compare the full physical addresses. Accordingly, in attainment of the aforementioned object, it is a feature of the present invention to provide a speculative store forwarding apparatus in a microprocessor pipeline. The pipeline includes first and second stages. The first stage receives load data specified by a load virtual address. The second stage stores store data pending in the pipeline for writing to a store physical address. The load virtual address includes a load virtual page number and a load physical page index. The store physical address includes a store physical page address and a store physical page index. The apparatus includes an index comparator that compares the load physical page index with the store physical page index. The apparatus also includes forwarding logic, coupled to the index comparator, which forwards the store data from the second stage to the first stage if the index comparator indicates the load physical page index matches the store physical page index.
In another aspect, it is a feature of the present invention to provide a microprocessor supporting paged virtual memory. The microprocessor includes an index match indicator that indicates whether a physical page index of load data specified by a load instruction matches a physical page index of store data pending in the microprocessor. The microprocessor also includes forwarding logic, coupled to the index match indicator, which forwards the store data to the load instruction if the index match indicator indicates that the load and store data physical page indexes match. The forwarding logic forwards the store data to the load instruction based on the index match indicator prior to determining whether the load and store data reside in a same physical memory page.
In another aspect, it is a feature of the present invention to provide a method for speculatively forwarding storehit data in a microprocessor pipeline. The method includes comparing a physical page index of data specified by a load instruction with a physical page index of store data pending in the pipeline. The method also includes forwarding the store data from a first stage of the pipeline in which the store data is pending to a second stage of the pipeline for receiving the data specified by the load instruction, if the physical page index of data specified by a load instruction matches the physical page index of said store data.
In another aspect, it is a feature of the present invention to provide a pipelined microprocessor. The microprocessor includes a plurality of store buffers that store a plurality of store datum waiting to be written to memory. The microprocessor also includes a plurality of address registers, coupled to the plurality of store buffers, which store a corresponding plurality of physical page indexes of the plurality of store datum. The microprocessor also includes a plurality of index comparators, coupled to the plurality of address registers, which compare the plurality of physical page indexes with a physical page index of data requested by a load instruction. The microprocessor also includes control logic, coupled to the plurality of index comparators, which causes one of the plurality of store datum to be forwarded from the plurality of store buffers to the load instruction, if the plurality of index comparators indicates a corresponding one of the physical page indexes matches the physical page index of the data requested by the load instruction.
An advantage of the present invention is that it reduces the amount of time required to perform a store forward in most cases over the prior method. In particular, the present invention alleviates the need to forego store forwarding if the microprocessor clock cycle timing did not allow sufficient time to forward based on a full physical address compare of the prior method. Additionally, the present invention alleviates the need to add another pipeline stage to accommodate store forwarding based on the full physical compare. This is beneficial because addition of another pipeline stage is detrimental to processor performance in the event of a mispredicted branch, since another stage of branch penalty would be introduced. Finally, the present invention alleviates the need to increase the clock cycle time of the processor to accommodate forwarding based on the full physical compare of the prior method.
Other features and advantages of the present invention will become apparent upon study of the remaining portions of the specification and drawings.