As integrated circuit technology progresses to smaller feature sizes, faster central processing units (CPU)s are being developed as a result. Unfortunately access times of memory subsystems, such as main memory in the form of random access memory (RAM), where instruction and data are typically stored, have not yet matched those of the CPU. The CPU must access these slower devices in order to retrieve instructions and data therefrom for processing thereof. In retrieving these instructions and data a bottleneck is realized between the CPU and the slower memory subsystem. Typically, in order to reduce the effect of this bottleneck a cache memory is implemented between the memory subsystems and the CPU to provide most recently used (MRU) instructions and data to the processor with lower latency. The purpose of the cache memory is to increase instruction and data latency of information flowing from the memory subsystem to the CPU. The latency is measured by an amount of clock cycles required in order to transfer a predetermined amount of information from main memory to the CPU. The fewer the number of clock cycles required the better the latency.
During CPU execution of instructions both the memory subsystem and cache memory are accessed. Cache memory is accessed first to see if corresponding data bytes fulfill the memory access request. If the memory access request is fulfilled then the a cache “hit” results, otherwise if the memory access request is unfulfilled then memory subsystem is accessed to retrieve the data bytes. Having to access the memory subsystem to retrieve the required data bytes is termed a cache “miss.” When a cache miss occurs the processor incurs stall cycles while the required data bytes are transferred from the memory subsystem to the processor and cache memory.
A process of pre-fetching data bytes from the memory subsystem is performed to reduce processor stall cycles. By anticipating future use of instruction and data, prefetching of this anticipated information from the memory subsystem is performed such that the cache memory can be provided with this information faster when the use actually takes place. As a result, the amount of processor stall cycles can be reduced since the data was prefetched and does not need to be fetched from the memory subsystem.
The process of pre-fetching data blocks uses a data bus, which provides for communication between the memory subsystem and cache memory. As a result of the pre-fetching operation the data bus bandwidth is decreased. In some cases the process of pre-fetching retrieves data blocks from the memory subsystem that will not be used by the processor. This adds an unnecessary load to the bus utilization. Fetching a data block into a certain level of the cache memory hierarchy requires replacing of an existing cache data block, where the replacing of such a data block may result in extra bus utilization since another fetch operation is generated to provide the correct data from the memory subsystem. Often, the cache data blocks are re-organized such that the block being replaced is moved to a lower level of the cache memory hierarchy. Furthermore, if the moved data block is no longer available at the highest level of cache memory hierarchy for future reference then a cache miss may result.
Furthermore, pre-fetching of extra data blocks in anticipation of their future use by the processor may also be inefficient because of the following reason. Having a number of pre-fetches occurring one after another may cause bursty bus utilization, which ends up decreasing the data bus bandwidth. Bursty bus utilization may cause temporary starvation of other components using the shared data bus resource, which may result in other types of processor stall cycles, thus having a degrading effect on processor and system performance.
When transferring data bytes from the memory subsystem to the cache memory, a unit of transfer is known as a cache line. Once a cache line has been transferred from the memory subsystem a subsequent cache line is pre-fetched from memory. The pre-fetching process is based on the assumption that pre-fetching the next sequential line from the memory subsystem will improve processor performance. When a cache miss on a current cache line occurs, the respective cache line is already fetched from the memory subsystem effectively reducing cache line fetch latency. In this case, a pre-fetched cache line is put into the cache memory only when a cache miss on this line occurs, before the cache miss occurs the pre-fetched cache line resides in a pre-fetch cache line buffer. This prevents victimizing useful cache lines by not used pre-fetched cache lines.
This process of performing next sequential cache line pre-fetching is a well-known technique that is used to reduce the effects of the slow memory subsystem access times, i.e. memory latency, visible to the processor. By hiding this latency using pre-fetching the processor incurs fewer stall cycles because a potentially anticipated cache line is already present in the pre-fetch buffer close cache memory. Such that the processor can access this pre-fetched cache line from cache memory more efficiently, thus potentially improving the processor performance.
Cache line pre-fetching may degrade system performance. Pre-fetched cache lines that are not subsequently required by the processor are then transferred from the memory subsystem to the processor, thereby consuming memory bandwidth, preventing other memory transactions to take place. This negative side effect becomes more apparent in a Unified Memory system on chip architecture in which multiple processing elements have to share the critical memory bandwidth resource. Therefore, pre-fetching is a process which needs to be addressed from a system architecture perspective as well as processor point of view.
Typically, a size of the cache lines used in cache memories is determined to allow optimal performance of the processor; typically, this requires cache line size to be optimized for the most common situation, a cache hit. Also, the size of a cache line relates to whether an increase or decrease in cache misses. If a cache line is too large then cache pollution results since the cache memory contains too much data and a majority of this data is unusable by the processor because it is incorrect. If the cache line is too small then the cache does not contain sufficient amounts of data to prevent most cache misses and the processor needs to pre-fetch data from the memory subsystem to facilitate processing operation.
The above problems become even more significant in multi processor systems where a plurality of processors share a same memory and bus resource. In such a system, cache misses affect performance of all the processors and effective resolution of cache misses is required in a timely manner in order to maintain overall system performance. Efficient pre-fetching significantly improves processor performance.