A very large number of systems involve the retrieval of data from a system memory by a device such as a processor. Many of these systems employ a technique known as data caching which exploits a property of data access known as temporal locality. Temporal locality means data that has been accessed recently is the data most likely to be accessed again in the near future. Data caching involves storing, or caching, a copy of recently accessed data in a cache memory that is accessible more quickly and efficiently than the system memory. If the same data is requested again in the future, the cached copy of the data can be retrieved from the cache memory rather than retrieving the original data from the system memory. As the cache memory can be accessed more quickly than the system memory, this scheme generally increases the overall speed of data retrieval.
To implement caching techniques, processor circuitry typically includes an internal cache memory which is located physically closer to the CPU than the system memory, so can be accessed more quickly than the system memory. When the processor requests data from the system memory a copy of the retrieved data is stored in the cache memory, if it is not stored there already. Some systems provide two or more caches arranged between the CPU and the system memory in a hierarchical structure. Caches further up the hierarchy are typically smaller in size, but can be accessed more quickly by the CPU than caches lower down the hierarchy. Caches within such a structure are usually referred to as level 1 (L1), level 2 (L2), level 3 (L3), . . . caches with the L1 cache usually being the smallest and fastest.
A typical cache memory comprises a series of cache lines, each storing a predetermined sized portion of data. For example, a typical cache memory is divided into 1024 cache lines, each 32 bytes in size, giving a total capacity of 32 kB. Data is usually cached in portions equal to the size of a whole number of cache lines. When an item of data smaller than a cache line is cached, a block of data equal to the size of one or more cache lines containing the data item is cached. For example, the data item may be located at the beginning of the cache line sized portion of data, at the end or somewhere in the middle. Such an approach can improve the efficiency of data accesses exploiting a principle known as spatial locality. The principle of spatial locality means that addresses referenced by programs in a short space of time are likely to span a relatively small portion of the entire address space. By caching one or more entire cache lines, not only is the requested data item cached, but also data located nearby, which, by the principle of spatial locality is more likely to be required in the near future than other data.
Each cache line of the cache memory is associated with address information, known as tags, identifying the region of the system memory from which the data stored in each cache line was retrieved. For example, the tag associated with a particular cache line may comprise the address of the system memory from which the cache line sized portion of data stored in that cache line was retrieved. The cache lines may be stored in a data memory portion of the cache, while the tags may be stored in a tag memory portion of the cache.
When a processor requests data from the system memory, the address of the requested data is first compared to the address information in the tag memory to determine whether a copy of the requested data is already located in the cache as the result of a previous data access. If so, a cache hit occurs and the copy of the data is retrieved from the cache. If not, a cache miss occurs, in which case the data is retrieved from the system memory. In addition, a copy of the retrieved data may be stored in the cache in one or more selected cache lines and the associated tags updated accordingly. In a system comprising a cache hierarchy, when data is requested from the system memory, the highest level cache is first checked to determine if a copy of the data is located there. If not, then the next highest level cache is checked, and so on, until the lowest level cache has been checked. If the data is not located in any of the caches then the data is retrieved from the system memory. A copy of the retrieved data may be stored in any of the caches in the hierarchy.
When applying caching techniques, it is important to ensure that the data stored in a cache represents a true copy of the corresponding data stored in the system memory. This requirement may be referred to as maintaining coherency between the data stored in the system memory and the data stored in the cache. Data coherency may be destroyed, for example, if data in one of the system memory and cache is modified or replaced without modifying or replacing the corresponding data in the other. For example, when the processor wishes to modify data, a copy of which is stored in the cache, the processor will typically modify the cached copy without modifying the original data stored in the system memory. This is because it is the cached copy of the data that the processor would retrieve in future accesses and so, for efficiency reasons, the original data stored in the system memory is not modified. However, without taking steps to maintain coherency, any other devices which access the data from the system memory would access the unmodified, and therefore out of date, data.
Various techniques may be applied to maintain data coherency in cache memory systems. For example, one process, referred to as write-back or copy-back, involves writing or copying data stored in one or more cache lines back to the region of system memory from which the cache lines were originally retrieved (as specified in the address information). This process may be performed in a variety of circumstances. For example, when data stored in a cache line has been modified, the cache line may be copied back to the system memory to ensure that the data stored in the cache line and the corresponding data in the system memory are identical. In another example, when data is copied into the cache as a result of a cache miss, an existing cache line of data may need to be removed to make space for the new entry. This process is known as eviction and the cache line of data that needs to be removed is known as the victim. If the victim comprises modified data, then the victim would need to be written back to the system memory to ensure that the modifications made to the data are not lost when the victim is deleted from the cache.
In some systems, special data coherency routines implemented in software are executed to maintain data coherency. Such routines may periodically sweep the cache to ensure that data coherency is maintained, or may act only when specifically required, for example when data is modified or replaced. These routines may include write-back or copy-back processes.
Some systems employ a technique known as data pre-fetching in which data may be retrieved, possibly speculatively, before it is actually needed in order to increase the overall speed of memory access. Data pre-fetches may be speculative in the sense that the pre-fetched data may not eventually be required. In one example of data pre-fetching, when executing a code loop in which an item of data needs to be retrieved within each iteration of the loop, the data required for a particular iteration may be pre-fetched during the preceding iteration. In this way, at the point the data is actually required, it does not need to be retrieved at that time. In another example, in highly integrated multimedia systems, very large quantities of data are manipulated, typically in a linear fashion, in a technique known as data streaming. In such applications, the future access patterns of data may be known some time in advance. In this case, data required in the future may be pre-fetched so that it is immediately available when eventually required.
Typically, pre-fetched data is stored in a cache and treated as cached data. In this way, when the pre-fetched data is actually requested, the cache will be checked to determine whether the requested data is located there. Due to the earlier data pre-fetch, a copy of the data can be retrieved from the cache, rather than accessing the system memory. Pre-fetching data into a cache is useful even in applications involving data accesses where the property of temporal locality do not apply. For example, in data streaming applications, data may only be used a single time, so temporal locality does not apply in this case. However, for the reasons given above caching pre-fetched data is advantageous.
Many processor architectures provide special pre-fetch instructions which allow software to cause data to be pre-fetched into a cache in advance of its use. Examples of such instructions include pre-fetch, preload or touch instructions. In such cases a cache normally communicate via a special interface which allows the cache to perform actions when a special instruction is executed by the processor. Data may be pre-fetched into any cache present in a cache hierarchy, such as a level 1 cache or level 2 cache. In some systems, pre-fetching data into a level 2 cache may be performed as a consequence of issuing a request to pre-fetch data into the level 1 cache.
A limiting factor in the performance of many systems is the delay between a CPU requesting data from memory and the data actually being supplied to it. This delay is known as memory latency. For example, the memory latency of highly integrated systems is typically 10-100 times the duration of the execution of a single instruction by the CPU. With the continuing development of processors, CPU clock rates are increasing rapidly, resulting in increasing demand for higher rates of data access. Even with improvements in the speed of memory access, the effects of memory latency are becoming more significant as a result.
In many applications, data is transferred to a system memory for use by a processor at a later time. In order to reduce memory latency, this data may be pre-fetched into a cache once the transfer is complete by executing appropriate instructions. High level languages do not have special pre-fetch instructions or commands built in and so CPU architecture dependent assembler inserts must be used. One problem is that the use of these inserts makes code non-portable.
A piece of software being executed by a processor may be made aware of the availability of data by means of an interrupt generated by the part of the system that has performed the data transfer. The software is made aware that the data is ready by handling the interrupt. However, the time between the data being available in the system memory and the time a the processor actually requires it may be short. One problem is that there is often a significant delay between the interrupt being generated and the software acting upon it to initiate a data pre-fetch using one or more instructions. Furthermore, the execution of instructions may take a significant period of time to complete. For this reason, even if an instruction is executed to pre-fetch data, the data may not be available in the cache at the time the data is required because the pre-fetch was not initiated sufficiently in advance for the pre-fetch operation to be completed in time.
There is a need, therefore, for a system and method for pre-fetching data which is as fast and efficient as possible and in which a data pre-fetch is initiated as early as possible.