The invention relates to information processors involving cache memory systems, and specifically to cache memory systems for use in vector processing. The research for this invention was supported, in part, by National Science Foundation Grant No. CCR-8909672.
Cache memories have been successfully used to improve the system performance of general purpose computers (see A. J. Smith, "Cache memories," Computing Surveys, vol 14, pp. 472-530, September 1982). Such general purpose computers typically involve scalar processors which, although capable of processing vector data, do so in a relatively time consuming fashion.
The processing of vector data (vector processing) can be more efficiently performed with the use of certain kinds of processors (called vector processors) which are designed to more efficiently process vectors. The overall operational speed of vector processors is limited, however, in part, by the time required to locate and move vector data to and from main memory. Although cache systems have been used in connection with vector processors, such systems are not efficient and typically do not increase the operating efficiency of the vector processor system. This is because present cache systems are not designed to efficiently cooperate with vector processors.
Current vector processors typically do not have cache memories because of resulting decreases in vector processing performance. These decreases in performance are due primarily to three factors. First, numerical programs generally have data sets that are too large for the present cache sizes. Vector data accesses of a large vector may result in complete reloading of the cache before the processor reuses the data. Second, address sequentiality is typically required by conventional caches. However, vectorized numerical algorithms usually access data at certain unit or nonunit intervals (referred to in the art as strides). Third, register files and highly interleaved memories have been commonly used to achieve high memory bandwidth required by vector processing. Cache memories may not significantly improve the performance of such systems. Due to the rapid advances in device technology and increased gap between processor speeds and memory speeds (see J. L. Hennessy and D. A. Patterson, Computer Architecture, A Quantitative Approach, Morgan Kaufmann, 1990), it has become increasingly important to study the performance of cache memories for vector processors (see H. S. Stone, High Performance Computer Architecture, Addison-Wesley, 1990).
Concern regarding poor performance of cache memories for processors due to the size of data sets has been studied by a number of researchers (see M. S. Lam, E. E. Rothberg, and M. E. Wolf, "The cache performance and optimizations of blocked algorithms," Proc. of Arch, Supp. for Prog. Lang. and Opr. Sys., pp. 63-74, April 1991; K. So and V. Zecca, "Cache performance of vector processors," Proc. 15th. Int'l Symp. on Comp. Arch., pp. 261-268, 1988; and D. Gannon, W. Jalby, and K. Gallivan, "Strategies for cache and local memory management by global program transformation," Int'l Conf. on Supercomputing, 1987). It is known that the memory hierarchy can be better utilized if numerical algorithms are blocked. Regarding the performance of cache memories in vector processors (vector caches) the vector data can be blocked into several segments and computations can be performed on the segments, instead of operating on an entire vector of a large size. Blocking is a general program optimization technique that promotes data reuse in high speed memories. It has been shown that blocking is effective for many algorithms in linear algebra (see J. Dongarra et al., "A set of level 3 basic linear algebra subprograms," ACM Transactions on Mathematics Software, vol. 16-1, pp. 1-17, March 1990).
The cache performance of a blocked matrix multiplication algorithm has been studied and it has been found that the blocking factor (the size of "inner loop") has a significant impact on cache performance (see Lam et al. supra). Performance studies have also been done for vector caches by means of trace driven simulations showing that although the program locality of vector executions is significantly different from that of scalar executions the cache hit ratio can be high enough to take advantage of having a cache (see So et al. supra). These studies are based on traces of sets of fixed size programs which are either cache-optimized subroutines from the machine library or are highly vectorized for vector machines. Such a cache may not, however, be as efficient for generally blocked programs with different problem sizes.
Sequentiality of vector addresses depends on vector access stride which varies widely in numerical algorithms. Since the basic storage unit in a cache is a cache line which consists of a group of consecutive memory words, cache pollution may result if the access stride is not one. Large cache lines may exploit the spatial locality of vector accesses with small strides but may lead to poor cache performance for large strides. Small cache lines, on the other hand, may increase the number of cache misses depending on the vector stride.
A comprehensive study has been presented on the effects of cache line sizes on the performance of vector caches (see J. W. C. Fu and J. H. Patel, "Data prefetching in multiprocessor vector cache memories," Proc. 18th. Int'l Symp. on Comp. Arch., pp. 54-63, 1991). The study proposes two prefetching schemes, sequential-prefetching and stride-prefetching, for vector caches to reduce the influence of long stride vector accesses. Certain performance improvements as a result of the two prefetching schemes were obtained. However, the cache miss ratios for some applications considered were still as high as over 40% in some cases. This is due, in part, to the fact that not only does the poor sequentiality of vector data result in cache pollution but it also results in a large amount of interference misses.
Conventional supercomputer vector machines use interleaved memory and large register files to increase memory access speeds. However, due to the increasing technological gap between processor speeds and memory speeds there is a need to increase memory speeds by involving a cache memory system in such supercomputer vector machines. A cache memory system is needed because not only is a register file relatively small as compared with the working set of a program but a register file system also requires the software programmer to make extra steps to manage the data. Cache memory, on the other hand, is transparent to programmers. Highly interleaved memory may provide enough bandwidth to single stream vector accesses, but the memory speed has to be extremely fast and the number of interleaved memory modules has to be very large in order to provide enough bandwidth for multiple stream vector accesses. It has been shown that hundreds and even thousands of interleaved memory modules are needed to achieve a reasonable memory performance for multiple stream vector accesses (see D. H. Bailey, "Vector computer memory bank contention," IEEE Trans. on Computers, vol. C-36, pp. 293-298, March 1987). Furthermore, vector processing has become a mainstream form of computing ranging from superminis to workstations. Vector processors have also been incorporated into mainframes as built-in accelerators for computation - intensive applications. For these types of machines, a cache memory can be a cost-effective enhancement towards a smooth memory hierarchy (see So et al. supra; Fu et al. supra; and W. Abu-Sufah and A. D. Malony, "Vector processing on the ALLIANT FX/8microprocessor," Int. Conf. on Parallel Processing, pp. 559-566, August 1986). Several vector computers feature a cache-based memory hierarchy such as IBM 3090 (see So et al. supra), Alliant FX/8 (see Abu et al. supra) and Vax/600 (see D. Bandarkar and R. Brunner, "VAX Vector architecture," Proc. 17th. Int'l Symp. on Comp. Arch., pp. 204-215, 1990).
Although cache memories have potential for improving the performance of future vector processors, there are practical reasons why such systems have not yet been satisfactorily efficient. A single miss in the vector cache results in a number of processor stall cycles equal to the entire memory access time, while the memory accesses of a vector processor without cache are fully pipelined. In order to benefit from a vector cache, the miss ratio must be kept extremely small. In general, cache misses can be classified into three categories (see Hennessy et al. supra): compulsory, capacity, and conflicts. The compulsory misses are the misses in the initial loading of data which can be properly pipelined in a vector computer. The capacity misses are due to the size limitations of a cache on holding data between references. If application algorithms are properly blocked as discussed above, the capacity misses can be attributed to the compulsory misses for the initial loading of each block of data provided that the block size is less than the cache size. The last category, conflict misses, plays a key role in the vector processing environment. Conflicts can occur when two or more elements of the same vector are mapped to the same cache line or elements from two different vectors compete for the same cache line. The former is called "self-interference" whereas the latter is called "cross-interference" (see Lam et al. supra). A study on blocked matrix multiplication algorithms shows that the self-interference misses increase drastically and dominate cache misses after the fraction of a 16K-word cache being used exceeds 3% (see Lam et al. supra). It is also shown that an algorithm with one problem size can run at twice the speed of the same algorithm with a different size.
Since conflict misses that significantly degrade vector cache performance are related to vector access stride, it may be desirable to adjust the size of an application problem to provide a beneficial access stride for a given machine. However, not only does this approach give a programmer a burden of knowing architecture details of a machine but such an approach is also not practical for many applications. It is known that the number of conflicts is minimized if the stride of accessing a vector is relative prime to the number of cache lines (or sets for set-associative cache) which is a power of 2 for direct or set-associative cache. Note that the stride required to access the major diagonal of a matrix is one greater than the stride required to access a row of the matrix stored in a column-major. Therefore, it is not possible to make both row access and major diagonal access efficient because one stride or the other is not relative prime to the cache size of any direct or set-associative cache.
It is therefore an object of the present invention to provide a cache memory system suitable for use in vector processing and in particular for use with vector processors.
It is a further object of the present invention to provide a vector cache memory indexing system capable of efficiently cooperating with numerical programs having data sets of various sizes and various access strides.