1. Field of the Invention
This invention relates to computing systems, and more particularly, to efficient storage of pending writes to memory corresponding to multiple threads.
2. Description of the Relevant Art
Modern microprocessors typically buffer retired store instructions that have yet to write data to a memory subsystem. A store queue (SQ) is a hardware structure configured to buffer retired store instructions, or write operations. A particular store instruction is generally held in this structure from the point-in-time the store instruction is retired to the point-in-time it is known that the store instruction has been processed by the memory subsystem such that the corresponding data of the store instruction is globally visible to all processors and threads within the system.
A large SQ may allow for a sufficient number of store instructions to be buffered in the event a store instruction misses in a local cache. When a cache miss occurs, many clock cycles, potentially hundreds of clock cycles considering dynamic random-access memory (DRAM) access latencies, may transpire before the missed store instruction is serviced by a cache fill transaction. If the SQ becomes full, then execution of the corresponding processor may halt. Therefore, it is desirable that a sufficient number of store instructions are buffered in order to handle the case of at least one cache miss.
Generally, modern microprocessors implement out-of-order instruction issue, out-of-order instruction execution, and in-order commit or retirement. Therefore, due to in-order retirement, the data of the store instructions buffered in the SQ may need to be conveyed in-order to a memory subsystem. Since the store instructions of a particular thread are allocated in-order in the SQ, the data of the store instructions are conveyed from the SQ to the memory subsystem in the order they are received, or in program order. Therefore, the SQ logically acts as a first-in-first-out (FIFO) buffer on a thread basis.
A read-after-write (RAW) hazard may occur when a load instruction, or a read operation, attempts to read a memory location that has been modified by an older (in program order) store instruction, which has retired. This older retired store instruction is resident in the SQ, but it has not yet committed its results to the memory location. Therefore, in order to prevent the load instruction from reading a stale value of the memory location contents, some action needs to be taken. For example, the load instruction may need to be stalled until the store instruction commits. Alternatively, the load instruction may have the modified memory location contents bypassed, or forwarded, from the SQ. Regardless of the chosen technique, a search within the SQ may need to be performed in order to detect this RAW hazard.
A search for the above RAW hazard, which may be implemented by circuitry for a content-addressable-memory (CAM) comparison of address and valid status information of all entries within the SQ, may occur before the load instruction may issue for execution. Circuitry for CAM match comparisons typically utilize dynamic logic that consume a relatively high amount of power. An access time of an array utilizing CAM comparison circuitry may be a factor in determining a processor's clock cycle duration. For example, as the number of entries increase in the array, the read, write, and CAM word line drivers need to charge and discharge a greater amount of electrical charge due to the gate and diffusion capacitances of each additional memory cell connected to these lines. In addition, each read, write, and CAM wire capacitance of these lines being charged increases from the increased wire length and cross capacitance. Further still, each memory cell may include additional power and ground lines for shielding of these read, write, and CAM lines, which further increases the size of each additional array entry. Each additional array entry affects on-die real estate, power consumption, and timing, wherein the effect of the latter two does not have a linear relationship. Therefore, the size of the SQ has an upper limit.
In addition, a processor may be multi-threaded, which may further place constraints on the SQ. For a multi-threaded processor, a single-threaded SQ may not be replicated by the number of threads in the multi-threaded processor due to on-chip real estate constraints. A store queue may have 64 entries, in one example, and these 64 entries may provide a desirable trade-off between performance and cost when the SQ is running in single-thread mode. Decreasing the number of entries may have a significant negative impact on performance. However, in a processor core that supports 8 threads, the SQ would need 64×8, or 512, entries, which may be far too large. Such a very large SQ consumes too much on-chip real estate and access times and CAM comparisons would drastically increase the clock cycle time of the processor.
Also, a multi-threaded processor may not comprise a SQ that is divided into sections, wherein each section corresponds to a particular thread. This is an inefficient use of SQ entries. For example, one thread may not be utilizing the SQ as frequently as a second thread. Alternatively, the one thread may not be executing at all, but the second thread is unable to efficiently utilize the available SQ entries since these entries are not assigned to the second thread. Therefore, a multi-threaded processor may utilize a SQ with dynamic allocation of its entries. In addition, with dynamic allocation, the SQ entries may be used in both single-threaded and multi-threaded modes of operation.
However, a caveat with dynamic allocation is there does not exist a relationship, implied or otherwise, between a SQ entry and the order of a corresponding store instruction with respect to other store and load instructions in the pipeline. Accordingly, the determination of load-store RAW hazards becomes more complex as logic needs to ascertain the SQ entries that are older (in program order) than a particular load instruction given that an index of the store instructions buffered in the SQ does not provide age ordering information. Also, recall that the data of the retired store instructions are to be conveyed in-order to a memory subsystem. However, with dynamic allocation, there is no indication which entry holds a corresponding next store instruction in program order from the current entry updating the memory subsystem.
In view of the above, efficient methods and mechanisms for storage of pending writes to memory corresponding to multiple threads are desired.