1. Field of the Invention
The present disclosure relates generally to a cache memory, and in particular but not exclusively, relates to its ability to deal with pre-fetch requests, and to methods of dealing with pre-fetch requests.
2. Description of the Related Art
It is known in the art to provide a cache memory as a level of memory hierarchy between a central processing unit CPU or other main processor or memory master, and a main memory (or other memory-mapped device). A cache memory duplicates commonly-used locations in a main memory for the purpose of speeding up accesses to these locations. In general it stores the most recently used instructions or data from the larger but slower main memory. This means that when the CPU wishes to access data, the access request can be made to the cache instead of to the main memory. This takes far less time than an access to a main memory, thus the CPU can read or write data more quickly and consequently runs more efficiently than if a cache memory were not used. The cache also updates the main memory with the duplicated locations when required, explicitly or implicitly.
Since computer programs frequently use a subset of instructions or data repeatedly, the cache is a cost effective method of enhancing the memory system in a “statistical” method, without having to resort to the expense of making all of the memory system faster. Currently the gap between CPU and memory clocks is widening. For example a 1.2 Ghz Athlon may only have a 133 Mhz memory system making caching even more important.
The cache is usually smaller than the main memory, which means that it cannot provide a duplicate of every location. Therefore, when an access request in the form of an address is made to the cache, it needs to determine if that particular location currently being requested is one of those duplicated locally or whether it needs to be fetched from the main memory, i.e., it performs a “tag compare” to see if that item of data is present in the cache. If the location is already stored in the cache, the access is termed a “hit” and if it is not it is termed a “miss”. The determining of whether an access is a hit or a miss takes an amount of time, thit. This time is normally the main factor in the amount of time that the cache takes to return a frequently used location and since speed is the purpose of such operations, this is designed to be as short as possible.
If the data is present (“hit”) it is returned quickly to the requesting CPU or suchlike, if however the item is not found (“miss”) then it is fetched from the main memory and stored into the cache.
When a cacheable request enters the cache, the address of the request is split into three fields. These are the tag, the line and the word fields. The tag field is the top part of the address that is compared with the addresses stored in the cache to determine whether the request is a hit or a miss. The line field is the part of the address that is used to locate the tag and data in a RAM array within the cache memory. The line is a collection of words, all of which are moved in and out of the cache at once. Thus the tag field shows for which location in memory the data for a given line is cached. The word field is the part of the address that specifies which word within a line is being accessed.
The line field is used to address two RAM arrays within the cache memory, one of which contains data (the data RAM) and the other of which contains tags (the tag RAM). In order to determine whether the request is a hit or a miss, the line field of the request is looked up so that the one or more tags in the tag RAM associated with that line can each be compared with the tag of the request. If the memory location shown by the tag in the tag RAM and the memory location shown by the tag of the request match, the request is a hit. If they do not match, the request is a miss.
Within the tag RAM, each tag location has a bit called “valid”. If this bit is set low, the tag of the tag RAM is ignored because this bit indicates that there is no data in the cache for the line associated with the tag. On the other hand, if a tag is stored in the cache for that line, the line is valid. The valid bit is set low for all tag locations within the tag RAM by, for example, a reset of the cache. In a write back cache, each tag location also contains a “dirty” bit. This dirty bit is set when the line is written to in response to a request from the CPU or suchlike. This line is then termed a “dirty line”. When a dirty line is replaced with new data, its contents must be written back to the main memory so as to preserve coherency between the cache and the main memory. The dirty bit is then reset.
Normally when an access is made to the cache, and the data is not already present (a miss) a stall occurs until the line can be filled from the main memory. Clearly this has a negative impact on the efficient running of the program. Increasing the size of the cache or the size of each line can reduce the number of cache misses and hence the number of stalls, because data corresponding to a larger number of addresses in the main memory can be stored at any one time in the cache. There is however a minimum number of misses that the cache can not avoid (termed “compulsory misses”) because that line has never been accessed before.
A cache miss can be classified as one of the following types:
(i) Compulsory Misses
If the data has never been accessed before then it will not be present in the cache. In this case the miss is classified as “compulsory”.
(ii) Capacity Misses
As a cache is of a finite size eventually old data will have to be replaced with new data. If the data requested from the cache would have been available in an infinite sized cache then the miss is classified as “capacity”.
(iii) Conflict Misses
A cache is made up of one or more banks. When an address is presented to the cache it uses some of the address bits to determine which row to look in. It then searches this row to see if any of the banks contain the data it requires, by matching the tags. This type of miss can be understood by considering the common organizational types of cache, as follows.
A cache memory is usually organized as one of three types. The first type is a direct-mapped cache in which each location in the cache corresponds to one location in the main memory. However, since the cache memory is smaller than the main memory, not every address in the main memory will have a corresponding address mapped in the cache memory. The second type is a fully-associative cache in which data is stored in any location in the cache together with all or part of its memory address. Data can be removed to make space for data in a different location in the main memory which has not yet been stored. The third type is an n-way associative cache, essentially a combination of the first and second types.
When a request is made to a fully associative cache, the whole cache is searched to see if the data is present, as if the cache had only one row but a large number of banks. A conflict miss in a different type of cache occurs when the requested data would have been present in a fully associative cache but is not present in the actual cache. In this case the data must have been discarded due to a bank conflict. That is, for a particular row more items of data need to be stored than there are banks available.
It would be desirable to provide a scheme which allows the cache to predict what data will be required next and thus reduce the number of compulsory misses to a minimum.
It is well known that most data access patterns have locality of reference. That is, if a particular address is accessed then there is a high probability that another location nearby will also be required. For example, certain applications (such as an MPEG decoder) tend to read their input in a stream, perform some computation and produce an output stream. In other words, they have sequential data access patterns. Sequential data access patterns have a high locality of reference because they always access the next adjacent location.
Given the locality of reference often present when executing programs, one known way to exploit the locality of reference is to have a cache line that is larger than a single data word. In this way when a data word is accessed its neighbors are also fetched into the cache. As the cache line gets larger there is a greater chance of exploiting the locality of reference. The disadvantage of making the cache line too big is that the number of cache conflicts increases and the miss penalty is made larger. In other words, if the line is too big, most of the data fetched is not required and therefore the cache becomes inefficient. Fetching more data also increases the bandwidth demand on the main memory system.
An alternative to increasing the cache line size is to pre-fetch data. This means the cache predicts what data is required next and fetches it before it is requested. One known system is that of application/compiler driven pre-fetching. In such a software driven pre-fetch scheme the assembler code contains instructions/hints that let the cache know it should ensure the specified data is in the cache. This means the cache can start fetching the data before it is required and thus reduce the number of stalls. While this scheme should work in theory, in practice it does not always perform as well as is expected. The main reason for this is that memory latencies are large. For example 30 cycles would not be an uncommon cache fill time. If the application wished to prevent a cache stall it would therefore have to issue a pre-fetch 30 cycles before the data is required. Assuming use of a modern processor that can issue up to four instructions per cycle this would imply the pre-fetch would have to be placed up to 120 instructions in advance, Performing a pre-fetch this far in advance is very hard to achieve.
The second problem arising in such a system is that pre-fetch instructions consume instruction bandwidth and potential issue slots. This means each pre-fetch instruction is taking up a slot that could be performing computation. It is possible that adding pre-fetch instructions will actually slow down an application. Another known scheme to pre-fetch data by exploiting locality of reference is to fetch a number of lines ahead. This scheme is known as fixed distance pre-fetch. In this scheme, when line ‘N’ is accessed as a result of a fetch request to the cache, the cache then pre-fetches line ‘N+d’ (where d is the pre-fetch distance), if it is not already present in the cache. For this scheme to work efficiently the cache must support up to ‘d’ outstanding memory requests and the value of ‘d’ needs to be set to such a value as to overcome the memory latency. For example if it takes 32 cycles to fetch a 16 byte cache line from memory and the processor can read one four byte data word per cycle then ‘d’ should be set to 8 (d=cycles/(linesize/datasize)=32/(16/4)).
The biggest problem with this fixed distance pre-fetch scheme is knowing what to set ‘d’ to. If it is too small then the pre-fetch will not prevent the processor from stalling on the cache. If it is too large then the cache will pre-fetch too much data causing extra bus bandwidth and potentially discarding useful data from the cache.