A known way to increase the performance of a computer system is to include a local high speed memory known as a cache. A cache increases system performance in part because there is a high probability that once the central processing unit (CPU) accesses data at a particular address it will soon access an adjacent address. A well designed cache typically fetches and stores a quantity of data, commonly referred to as a line, that includes data from a desired memory address as well as data from addresses in the vicinity of the desired address from slower main memory or from a lower level cache. In very high performance computer systems, several caches may be placed in a hierarchy. The cache which is closest to the CPU, known as the upper level or L1 cache, is the highest level cache in the hierarchy and is generally the fastest. Other generally slower caches are then placed in descending order in the hierarchy, starting with the L2 cache, etc., until the lowest level cache which is connected to main memory. Note that typically, the L1 cache is located on the same integrated circuit as the CPU whereas the L2 cache may be located off chip.
Recently, microprocessors designed for desktop applications such as personal computers (PCs) have been modified to increase processing efficiency for multi-media applications. For example, a video program may be stored in a compression format known as the motion picture experts group (MPEG-2) format. When processing the MPEG-2 data, the microprocessor must create frames of decompressed data quickly enough for display on the computer screen in real time. However, when processing MPEG-2 data, the data set may be large enough to cause high cache miss rates, resulting in a fetch latency that can be as long as 100 to 150 processor clock cycles.
Even with aggressive out-of-order processor micro-architectures, it is difficult for the processor to make forward progress in program execution when waiting for data from long latency memories when cache miss rates are significant. Moreover, for data processing systems that require coherent data sharing between a processor and another peripheral device such as a graphics card or in processing systems requiring coherent data sharing between multiple processors, it is even more difficult for the processing system to make forward progress in program execution when waiting for data from long latency memories when cache miss rates are significant. Accordingly, a need exists for processors and processing systems which allow for efficient use of memory subsystem resources and prevent memory stalls on cache misses.