1. Field of the Invention
The present invention generally relates to high performance processors that use hardware managed caches to enhance their performance and, more particularly, to engineering and scientific vector processors, especially those that feed the vector through the cache.
2. Description of the Prior Art
In high performance computers, caches serve to reduce the observed latency to memory. The cache provides a relatively small but very high performance memory very close to the processor. Data from the much larger but slower main memory is automatically staged into the cache by special hardware on a demand basis, typically in units of transfer called "lines" (ranging, for example, from 32 to 256 bytes). If the program running on the computer exhibits good locality of reference, most of the accesses by the processor are satisfied from the cache, and the average memory access time seen by the processor will be very close to that of the cache; e.g., on the order of one to two cycles. Only when the processor does not find the required data in cache does it incur the "cache miss penalty", which is the longer latency to the main memory; e.g., on the order of twenty to forty cycles in computers with short cycle times. For a given cache structure, a program can be characterized by its "cache hit ratio" (CHR) which is the fraction of the accesses that are satisfied from the cache and hence do not suffer the longer latency to main memory.
Given the size of the cache, the structure of the cache has to be decided in terms of line size (in bytes), the number of lines, and the set associativity. Numerous design trade-off considerations go into these decisions. For example, the line size is chosen so that it is sufficiently large since most references are sequential and make efficient use of prefetched data. If the line size is small, it results in more line misses, and hence more miss penalty, for the same amount of data that currently defines the program locality. Further, smaller lines result in more lines in the cache and have cost, complexity and performance implications in the cache directory design.
The line size is chosen so that it is not too large, since that may result in too few lines and hence would restrict over how many disjoint regions the locality of reference may be distributed. Further, if the line size is large, each line miss will bring in a large number of data elements, all of which may not be used during the line's residency in the cache. This results in time and available main memory bandwidth being spent unnecessarily for data that will not be referenced.
The set associativity of the cache is selected to reduce the probability of cache line thrashing situations. Line thrashing occurs when the current locality of reference includes more lines from a congruence class that map into the same set than the level of associativity provided. This results in the lines constantly displacing each other from the cache and thus driving down the CHR. The set associativity, on the other hand, cannot be arbitrarily large since it has a bearing on the cost and complexity or,the cache look-up mechanism.
References by an instruction may exhibit poor cache hit ratio for several reasons. For example, the instruction is in a loop and it references the elements of a data structure with a non-unit stride. Classic examples are references to elements along various directions of a multi-dimensional matrix and referencing a single column in a table of data. If the line size is L elements and the stride s is greater than L, each line will fetch L elements, only one of which will be utilized in the immediate future. If the size of the data structure is large and/or there are several other data structures being accessed by other instructions in the loop, these references will tend to flush the cache so that when another element in the same line is referenced, it will already have been displaced. This leads to situations where the cache hit ratio degrades to close to zero and the latency approaches main store access time, resulting in poor performance. Performance is degraded further because of the fact that for each element utilized, the cache mechanism fetches L-1 additional elements that are never referenced while in the cache. This incurs the delay for the additional fetches as well as the deprivation of the available main memory bandwidth from the other processors in the system. Moreover, the increased cache coherence traffic can cause further degradation in all processors in the system.
Another situation which causes poor cache hit ratio is where the instructions in a loop reference several data objects or several areas of the same data object that all fall in the same congruence class. This can occur more often than one may anticipate if the dimensions of the data objects are a power of two. One can expect to see more and more of that since in a parallel processing system, the available processors are typically a power of two. The natural tendency, then, is to have data objects whose dimensions are also a power of two so as to make it easy to partition them across the processors.
Additionally, striding through large data objects in a non-unit stride direction causes not only the particular instruction to experience poor hit ratio, but it can also cause the code surrounding those instructions to suffer. This is because the instructions with bad locality may have flushed the cache of useful data.