A processor comprises a memory from which code is fetched for execution, and an instruction cache where a copy of recently fetched code is stored in case that code is needed for execution again soon after. This is temporal locality. In addition, as will be discussed in more detail below, in order to exploit spatial locality then the memory space is divided into regions sometimes referred to as segments, from which code is fetched into correspondingly sized cache-lines in the cache such that code near to a recently used portion of code will also be cached.
The instruction cache is smaller and is coupled directly to the processor's central processing unit (CPU), in contrast with the memory which is larger and must be accessed via a bus (and memory controller in the case of an external memory). Thus the access time involved in retrieving instructions from the cache is less than that involved in fetching the same instructions from memory, and so it is preferable to retrieve instructions from the cache whenever possible.
Of course, since the cache is of a limited size, then all the code of a program of any realistic size cannot be held there simultaneously. Therefore sometimes when new code is fetched from memory and cached, some other code in the cache must be displaced and discarded from the cache in order to make room for the new code.
When fetching an instruction of code, the processor will refer to a caching mechanism. This will then check the cache to see if a copy of the required instruction is already there. If so, the cache can provide the instruction to the processor immediately instead of having to fetch it again from memory. This may be referred to as a cache “hit”. If not however, the caching mechanism must fetch the instruction from memory, which will take longer. This may be referred to as a cache “miss”.
The efficacy of the caching can be measured in terms of its cache hit rate, i.e., the proportion of times that an instruction is found in the cache when required for execution and so need not be fetched from memory; or conversely the cache miss rate, i.e., the proportion of times that an instruction is not found in the cache when required for execution and so must be fetched from memory. The more efficient the caching, the more time will be saved in accessing instructions for execution.
The caching mechanism on chip is typically fixed, i.e., so that it cannot vary its behaviour for different programs or circumstances to try to improve the cache hit rate. Indeed, typically the caching mechanism is completely hardwired. Nonetheless, some attempts have been made to improve cache hit rate by optimising the code itself during compilation in order to make best use of the caching mechanism.
FIG. 1 is a schematic representation of an instruction cache 6 and a memory 19 storing code to be executed on a processor. FIG. 1 also shows an example mapping between the cache 6 and the memory 19 (although other possible mappings exist). The cache 6 comprises a plurality n of cache lines L=0 to n−1, each being the same number m of instruction words long. The cache lines L are mapped to the memory 19 such that it is divided into a plurality of regions (or segments) R, each region being the same size as a cache line, i.e., m words. Each cache line L is mapped by the processor's caching mechanism to a different respective plurality of non-contiguous regions R. For instance, in the example shown, the n cache lines L=0 . . . n−1 are mapped sequentially in address space to the first n regions of memory respectively, and then sequentially to the next n regions respectively, and so on, with the pattern repeating in the memory's address space throughout the memory 19. So here, the first cache line L=0 is mapped to the memory regions R=0, R=n, R=2n, etc., corresponding to memory address ranges 0 to m−1, nm to (n+1)m−1, 2 nm to (2n+1)m−1, etc. respectively; and the second cache line L=1 is mapped to the memory regions R=1, R=n+1, R=n+2, etc., corresponding to memory address ranges m to 2m−1, (n+1)m to (n+2)m−1, (2n+1)m to (2n+2)m−1, etc. respectively; and so on.
Because each cache line L may be mapped to multiple possible regions R of memory, it cannot of course simultaneously cache code from all the memory addresses to which it may be mapped. Therefore the cache is arranged so that each line L can only be used to cache code from one of its mapped regions R at any given time. The processor maintains a record for each cache line L=0 . . . n−1 with an entry recording the base address (“tag”) of the region R of memory to which that line is currently assigned. The processor also maintains in each record a valid flag for each cache line L recording whether that line is valid, i.e. whether the code from the assigned region R has been fetched to that cache line L.
When the processor is to fetch an instruction from an address as specified in its program counter, the caching mechanism consults the relevant record to determine whether that instruction is validly cached, and so whether or not it needs to fetch it from memory 19 or whether it can be retrieved from the cache 6 instead. If the required instruction is located in a different region R than currently mapped to the appropriate cache line L, then there must be a cache miss because the cache line L can only validly cache instructions from one region R at a time. The caching mechanism then reassigns the cache line in question to the new region and fills the cache line with the code from that region, asserting the corresponding valid flag once it has done so. If on the other hand the required instruction is of the same region as currently mapped to a cache line L, then there may either be a cache hit or a cache miss depending on whether the respective valid flag of that line L is asserted. Cache fills are done at the granularity of cache lines in order to exploit spatial locality (and burst accesses to memory).
The program to be executed is stored in the memory 19 and takes up a certain area thereof, typically a contiguous or largely contiguous range of addresses. The program will comprises a plurality of functions arranged in some order in address space. Functions may be stored in regions of memory R potentially mapped to the same cache line L and/or regions mapped to different cache lines. At least some of the functions when executed will call other functions in the program. However, this can cause cache misses, because the program “jumps around” in address space in a manner not coherent with the mapping of cache lines L to memory regions R.
As an illustration of this problem, referring again to FIG. 1, consider for example a program having a function A located in region R=0 and another function B located in region R=n, both of which regions can be mapped to the same cache line L=0. Now, if function A calls function B then there will be a cache miss because the cache line L=0 cannot validly cache the regions of both functions A and B at the same time. Thus the caching mechanism will have to fetch function B from memory 19, and will have to reassign the cache line L=0 to region R=n of memory in place of region R=0: thus in effect function B displaces function A from the cache 6.
Worse, if function B returns to function A, this will cause another cache miss because again the cache line L=0 cannot validly cache the regions of both functions A and B at the same time. The processor will then have to re-fetch function A from memory 19, and the caching mechanism will have reassign the cache line L=0 back to region R=0 in place of region R=1 so that function A displaces function B.
Worse still, if function A calls function B multiple times, the processor will have to repeatedly fetch functions A and B as they repeatedly displace one another back and forth from the cache 6. This phenomenon is sometimes known as “thrashing”. It would be desirable to reduce the chance thrashing in order to avoid unnecessary instruction fetches and thus improve processor speed.
To reduce thrashing or other unnecessary fetches, the goal is therefore to generate the program in such a way that functions like A and B will appear in the same cache line L (in an ideal case), or otherwise to at least try to ensure that they do not occupy different regions R that are mapped to the same cache line L (a more likely case). The probability of both of these goals being achieved can be increased by, for as many functions as possible, arranging the code in address space such that a dependent function (i.e., a descendent or child function) is grouped together into the same region as a function from which it is called (i.e., its parent function), or if that is not possible to at least try to ensure those functions are grouped close by so as not to be located in different regions mapped to the same line (which would result in thrashing). This may not be possible for all functions, and it may not be practical to explicitly consider which functions will be located in which regions; but, generally speaking, the closer that functions are grouped together with their dependent functions then the more likely it is that they will be found within the same region and so be cached into the same cache line when executed, or failing that then the more likely it is that they will not fall within different regions mapped to the same cache line. That is, the more likely it is that related functions will be cached into the same cache-line or at least another non-thrashing cache line.
Note that in some processors, the cache may be an associative cache whereby each different plurality of regions R is mapped to multiple respective cache lines L, usually a small number of cache lines such as two and typically no more than eight. So for example in a two-way associative cache, there are 2n cache lines, such that each plurality of regions comprising a set R, R+n, R+2n, etc. is mapped to two respective cache lines L and L+n. Or in general for a p-way associate cache, there are pn cache lines, such that each plurality of regions comprising a set R, R+n, R+2n, etc. is mapped to p respective cache lines L, L+n, . . . L+p(n−1). Associative caches reduce the chance that code cached from one region of memory needs to be displaced in order to cache code from another region of memory. However, while the associatively p is limited then the above-described problems of cache-misses and thrashing can still occur, and the cache can still benefit from grouping functions with their dependent functions.
Also, note if the fetch unit 5 has a pre-fetch mechanism whereby it automatically pre-fetches code into the next cache line L+1 in advance, then grouping functions together with their dependents may also improve the chance that the required function is cached by the pre-fetch mechanism by the time it is required.