1. Field of the Invention
This invention relates to computing systems, and more particularly, to increasing the throughput of a processor during cache misses.
2. Description of the Relevant Art
Pipelining is used to increase the throughput of instructions per clock cycle (IPC) of processor cores, or processors. However, the throughput may still be reduced due to certain events such as pipeline stalls. Stalls may be caused by a branch misprediction, a cache miss, data dependency, or other, wherein no useful work may be performed for a particular instruction during a clock cycle.
Different techniques are used to fill these unproductive cycles in a pipeline with useful work. Some examples include loop unrolling of instructions by a compiler, branch prediction mechanisms within a core and out-of-order execution within a core. An operating system may divide a software application into processes and further divide processes into threads. A thread, or strand, is a sequence of instructions with no control flow instructions that may share memory and other resources with other threads and may execute in parallel with other threads. A processor core may be constructed to execute more than one thread per clock cycle in order to increase efficient use of the hardware resources and reduce the effect of stalls on overall throughput. A microprocessor may include multiple processor cores to further increase parallel execution of multiple instructions per clock cycle. 
The above techniques may hide some of the unproductive clock cycles due to cache misses by overlapping them with useful work of other instructions. If the latencies of L1 and L2 cache misses are great, some unproductive cycles may still occur in the pipeline and the IPC may still decrease. Some techniques to decrease the stall cycles due to cache misses include using larger sized caches, using higher associativity in the caches, speculatively prefetching instructions and data, use non-blocking caches, using early restart or critical word first, using compiler optimizations, or other.
Some scientific applications are memory intensive such as high performance computing (HPC) software applications. A few application examples include climate simulations of the world's oceans, complex fluid dynamic (CFD) problems such as a tunnel model of an aircraft wing using Navier/Stokes equations, computational chemistry, and an air quality model used by the U.S. environment protection agency (EPA). These scientific applications are memory intensive with ratios of memory instructions per single floating-point instruction as high as 400 to 1,400. Also, the codes tend to be loop-intensive and benefit from architectures that offer single-instruction-multiple-data (SIMD) operations. The loops are able to operate on multiple data elements in a data set with a single operation.
Therefore, a critical performance bottleneck for a processor executing code as described above, is a processor's forwarding-store buffer and the cache design. The stall cycles from cache misses need to be reduced in order to efficiently supply data to the operations. A non-blocking cache may be used in order to perform hits-under-misses and increase the IPC.
A problem may arise with scientific applications that do not have data locality and therefore have high data dependency such as computational chemistry. A  non-blocking cache may not help, since the data from hits-under-miss may not be used until the data from the cache miss is returned. A blocking cache may ensure in-order supply of the data, but the latencies from the cache miss and subsequent cache hits accumulate and reduce the IPC.
In view of the above, efficient methods and mechanisms for increasing the throughput of processors are desired. 