A microprocessor is a digital device that executes instructions specified by a computer program. A typical computer system includes a microprocessor coupled to a system memory that stores program instructions and data to be processed by the program instructions. The performance of such a system is hindered by the fact that the time required to fetch data from the system memory into the microprocessor, referred to as memory fetch latency, is typically much larger than the time required for the microprocessor to execute the instructions that process the data. The time difference is often between one and two orders of magnitude. Thus, the microprocessor may be sitting idle with nothing to do while waiting for the needed data to be fetched from memory.
However, microprocessor designers recognized long ago that programs tend to access a relatively small proportion of the data a relatively large proportion of the time, such as frequently accessed program variables. Programs with this characteristic are said to display good temporal locality, and the propensity for this characteristic is referred to as the locality of reference principle. To take advantage of this principle, modern microprocessors typically include one or more cache memories. A cache memory, or cache, is a relatively small memory electrically close to the microprocessor core that temporarily stores a subset of data that normally resides in the larger, more distant memories of the computer system, such as the system memory. A cache memory may be internal or external, i.e., may be on the same semiconductor substrate as the microprocessor core or may be on a separate semiconductor substrate. When the microprocessor executes a memory access instruction, the microprocessor first checks to see if the data is present in the cache. If not, the microprocessor fetches the data into the cache in addition to loading it into the specified register of the microprocessor. Now since the data is in the cache, the next time an instruction is encountered that accesses the data, the data can be fetched from the cache into the register, rather than from system memory, and the instruction can be executed essentially immediately since the data is already present in the cache, thereby avoiding the memory fetch latency.
However, some software programs executing on a microprocessor manipulate large chunks of data in a relatively regular and linear fashion, which may be referred to as processing of data streams. Examples of such programs are multimedia-related audio or video programs that process a data stream, such as audio or video data. Typically, the data stream is present in an external memory, such as in system memory or a video frame buffer. Generally speaking, these programs do not demonstrate good temporal locality, since the data streams tend to be large, and the individual data elements in the stream are accessed very few times. For example, some programs read in the data stream only once, manipulate it, and write the results back out to another location, without ever referencing the original data stream again. Consequently, the benefits of the cache are lost, since the memory fetch latency must still be incurred on the first read of the data stream.
To address this problem, several modern microprocessors exploit the fact that that many times the programmer knows he will need the data well before execution of the instructions that actually process the data, such as arithmetic instructions. Consequently, modern microprocessors have added to or included in their instruction sets prefetch instructions to prefetch data into a cache of the processor before the data is needed. Some processors have even included prefetch instructions that enable the programmer to define a data stream to be prefetched. Other microprocessors have added hardware to detect a pattern of a data stream being accessed and begin prefetching into the data cache automatically. Prefetching enables the microprocessor to perform other useful work while the data is being prefetched from external memory in hopes that the data will be in the cache by the time the instruction that needs the data is executed.
However, current prefetching techniques still suffer drawbacks, and the need for improved prefetching performance is constantly increasing due to the proliferation of multimedia data streams and because memory latency is becoming longer relative to microprocessor execution speed.