A known way to increase the performance of a computer system is to include a local, high speed memory known as a cache. A cache increases system performance because there is a high probability that once the central processing unit (CPU) has accessed a data element at a particular address, its next access will be to an adjacent address. The cache fetches and stores data which is located adjacent to the requested piece of data from a slower, main memory or lower-level cache. In very high performance computer systems, several caches may be placed in a hierarchy. The cache which is closest to the CPU, known as the upper-level or "L1" cache, is the highest level cache in the hierarchy and is generally the fastest. Other, generally slower caches are then placed in descending order in the hierarchy starting with the "L2" cache, etc., until the lowest level cache which is connected to main memory. Note that typically the L1 cache is located on the same integrated circuit as the CPU and the L2 cache is located off-chip. However as time passes it is reasonable to expect that lower-level caches will eventually be combined with the CPU on the same chip.
Recently, microprocessors designed for desktop applications such as personal computers (PCs) have been modified to increase processing efficiency for multimedia applications. For example, a video program may be stored in a compression format known as the Motion Picture Experts Group MPEG-2 format. When processing the MPEG-2 data, the microprocessor must create frames of decompressed data quickly enough for display on the PC screen in real time. However, when processing MPEG-2 data, the data set may be large enough to cause high cache miss rates, resulting in a fetch latencies that may be as long as 100 to 150 processor clock cycles.
Even with aggressive out-of-order processor microarchitectures, it is difficult for the processor to make forward progress in program execution when waiting for data from long latency memories when cache miss rates are significant.
To help hide this long main memory latency many instruction set architectures have added instructions which serve only to prefetch data from memory into the processor's cache hierarchy. If software can predict far enough in advance the memory locations which the program will subsequently use, these instructions can be used to effectively hide the cache miss latency. This can be done because the software directed prefetch mechanism only uses resources which serve cache misses and do not tie up other valuable resources such as completion buffer entries and register renames.
One way of providing software prefetching has been classified as synchronous software directed prefetching. The prefetching is synchronous because the prefetch hint usually specifies a small amount of memory, such as a single cache line, and can be executed in program order like any other load instruction. In architectures such as the Power PC architecture, available from Motorola, Inc. of Austin Texas, instructions called data cache block touch and data cache block touch for store are examples of synchronous software prefetch instructions.
Another instruction class of prefetch instructions is called data stream touch (DST). DST instructions are classified as asynchronous because the instructions can specify a very large amount of memory to be prefetched in increments of cache blocks by a DST controller. The DST controller runs independently of normal load and store instructions. That is, the controller runs in the background while the processor continues normally with the execution of other instructions. DST instructions are useful where memory accesses are predictable and can be used to speed up many applications, such as for example, multimedia applications.
However, the DST controller must use the memory management unit and other data cache resources for hits or misses in order to perform a single cache block prefetch. If the same MMU and data cache is used for both normal load and store instructions and for the DST controller, then the problem exists of how to divide the use of these resources between the DST controller and normal loads and stores for best overall performance.