Throughout the development of modern computer systems, one important facet of system performance has been memory access time. While reducing access times to the actual memory devices, such as dynamic random access memories (DRAMs), has had a profound effect on system throughput, architectural advances have also increased memory system performance. Perhaps the most important of these advances is the use of cache memory residing between a processor and a main memory of the computer system.
Generally, a cache memory resides in a memory system hierarchy between one or more processors and a main memory. The cache is relatively small and fast memory compared to the main memory and holds copies of a portion of the data residing within the main memory address space. Since the cache is smaller than the main memory, the cache is not capable of holding all of the data that may reside in the main memory. Instead, the cache typically is designed to hold data which the processor most often accesses. Moreover, multiple cache levels are often employed between the processor and the main memory, with higher levels of cache (i.e., those cache levels located closer to the processor) being relatively smaller and faster than lower cache levels. As an example, the use of three or four cache levels in commercial computing systems is now commonplace.
Typically, data is stored within the cache memory in response to the processor reading data from the main memory. As the data passes through the cache, the data may be stored therein so that subsequent requests for the same data may be satisfied via the cache instead of the slower main memory. In other cases, data written by the processor to the main memory may be stored in the cache as it passes to the main memory. Given the limited amount of storage space within the cache, any of several caching algorithms, such as “least recently used” (LRU) and “least frequently used” (LFU), have been devised to determine which data is to be stored in the cache, and which is to be discarded. The primary goal of such an algorithm is to maximize the cache “hit ratio,” or the percentage of processor read requests for data that the cache may satisfy.
To further increase computer system performance, some caching memory systems utilize “pre-fetching.” More specifically, rather than wait for the processor to request data before retrieving that data from the main memory and storing it in the cache, the memory system may retrieve the data from the main memory and store it in the cache prior to the processor requesting the data, thus eliminating the latency between the request and the storing of the data in the cache. To implement pre-fetching, caching memory systems often presume data requests will follow in a linear or sequential fashion, continuing with the next memory address following the most recent data request.
Unfortunately, many data access patterns do not follow a linear or sequential pattern. For example, multiple software threads may be executing on one or more processors coupled with the memory system. Under that scenario, each thread may be requesting data in a sequential fashion, but when the requests are received concurrently and collectively at the memory system, the sequential nature of the memory accesses of each separate thread is not apparent. One example of a system executing several such software threads is a relational database decision-support server. Queries to a relational database are often processed by multiple software threads executing concurrently, with each thread accessing a separate database “relation,” or table, often combining the data from the tables in an operation called a “join.” However, while each thread may retrieve data sequentially from the system address space, the memory system may only see memory requests that spatially appear to be at least somewhat random, thus defeating any potential benefit from a standard pre-fetching algorithm.