Not applicable.
Not applicable.
1. Field of the Invention
The present invention generally relates to a computer system that includes one or more processors each containing a vector execution unit and a bank-interleaved cache. More particularly, the invention relates to a processor that is able to access a bank-interleaved cache containing relatively large strided vectors of data. Still more particularly, the present invention relates to a system that provides high cache bandwidth and low access times for memory accesses to large strided data vectors.
2. Background of the Invention
Most modem computer systems include at least one processor and a main memory. Multiprocessor systems include more than one processor and each processor typically has its own memory that may or may not be shared by other processors. The speed at which the processor can decode and execute instructions and operands depends upon the rate at which the instructions and operands can be transferred from main memory to the processor. In an attempt to reduce the time required for the processor to obtain instructions and operands from main memory, many computer systems include a cache memory coupled between the processor and main memory.
A cache memory is a relatively small, high-speed memory (compared to main memory) buffer which is used to temporarily hold those portions of the contents of main memory which it is believed will be used in the near future by the processor. The main purpose of a cache memory is to shorten the time necessary to perform memory accesses, for both data and instructions. Cache memory typically has access times that are several or many times faster than a system""s main memory. The use of cache memory can significantly improve system performance by reducing data access time, therefore permitting the CPU to spend far less time waiting for instructions and operands to be fetched and/or stored.
Processors in computer systems access data in words from the cache memory or physical main memory. In any given processor architecture a xe2x80x9cwordxe2x80x9d may include one or more bytes, such as one, two, four, eight, sixteen or preferably any power of two. For some applications involving large amounts of data, a xe2x80x9cvectorxe2x80x9d of data words may be required by the application. A vector is an ordered set of words stored in memory. The addresses of the vector""s words form a consecutive sequence in which each term after the first is formed by adding a constant value to each preceding term. Thus, if a two-dimensional array (i.e., a rectangular arrangement of words in rows and columns) is stored in a computer memory, rows, diagonals, and columns are vectors.
A xe2x80x9cstridedxe2x80x9d vector is a vector of data that can be characterized by a base address A, a stride S, and strided vector length L. A stride S can be defined as the difference between successive addresses in a pattern of address accesses. A xe2x80x9csimple stridexe2x80x9d has a constant value, wherein each successive address in the consecutive sequence of addresses is the same constant value away from its previous address. A xe2x80x9cunit stridexe2x80x9d is a simple stride with a constant value of one, that has data in each of a series of consecutive memory addresses (e.g., memory addresses 4, 5, 6, 7, 8, etc.). Each consecutive memory address fetches consecutive memory words contained in the unit stride vector. A xe2x80x9cnon-unit stridexe2x80x9d is a simple stride with a constant value other than one. A vector of data with a non-unit stride stored in memory contains data that skips at least some memory addresses of a series of consecutive memory addresses (a stride with a constant value of 3, accesses memory addresses 3, 6, 9, 12, 15, etc.). A more complex stride has a repeating pattern of addresses between the required strided vector data addresses. An even more complex stride has a non-repeating, but a predictable or specifiable pattern of addresses between successive addresses of the required strided data vector.
A xe2x80x9cvector computerxe2x80x9d containing a vector execution unit performs operations on vectors of data instead of on single words as in a conventional scalar computer containing a scalar processor unit. Vector computers can efficiently execute software applications requiring large amounts of data. Large dense data structures manipulated by scientific applications can be processed quickly by a vector computer. Because of the iterative nature of software application loops and their relative independence in comparison to other portions of application code, loops in a vector computer can be executed in parallel.
Vector computers have been built since the beginning of the 1960""s to exploit application code and data parallelism to reduce program execution time. Vector computers often use xe2x80x9cbank interleavedxe2x80x9d memories which include multiple, independently accessible banks of storage. In a bank-interleaved memory, each bank is independent of all other banks and each bank has separate ports to transmit and receive data and addresses. A vector computer also includes a vector execution unit capable of processing data vectors. Vector computers have used bank-interleaved memories that store the data vector and a vector execution unit to process the data vectors. Vector execution units directly access the bank-interleaved memory for data or instructions without first sending the request to a smaller faster cache memory.
The caching of vectors of data in a processor has been considered in F. Quintana, J. Corbal, R. Espasa and M. Valero, xe2x80x9cAdding a Vector Unit to a Superscalar Processorxe2x80x9d International Conference on Supercomputing (ICS), ACM Computer Society Press, Rhodes, Greece, June 1999. This publication discusses use of only unit stride vectors of data stored in cache memories of a processor.
The SV1 processor series manufactured by SGI-Cray(copyright) describes caching vectors of data in a processor. The SV1 processor architecture implements a bank-interleaved cache memory with each bank being eight bytes wide. The architecture permits simultaneous parallel accesses with different addresses to all banks, allowing parallel access to all odd strided vectors, but cache blocks must be one quadword (eight bytes) wide (thus a bank contains one cache block).
The approach developed by Quintana et al. has the advantage that there is no constraint on the cache block width; however, only unit stride vectors may be accessed in parallel. However, most applications cannot be executed on a vector execution unit if only unit stride vectors are permitted-that is data and instructions of an application cannot easily be converted into a unit stride vector. The solution implemented for the SV1 processor series permits full cache bandwidth for all odd strided vectors, but requires the use of a eight byte cache block size and therefore the use of one address tag per eight bytes.
Advances in chip fabrication technology allow a vector execution unit (e.g., a unit with 16 or 32 identical scalar functional units) to fit on a single processor chip along with a scalar processor unit and a cache memory. In such a processor, both the vector execution unit and the scalar processor unit use the cache memory to access instructions and data. Thus, the cache memory must be able to provide high access bandwidth for the large vector data sets needed by the vector execution unit. Bank-interleaved caches can be used to provide high access bandwidth. Similar to bank-interleaved memories, bank-interleaved caches include banks that operate independently of each other. Each bank has separate data and address ports; however, accesses to data words within the same bank may result in intrabank conflicts caused by both of the data words simultaneously requiring the same data port and address port, significantly reducing overall system performance and severely impacting the useful cache memory bandwidth. Moreover, hardware solutions to reduce intrabank conflicts can be very complex and expensive to implement.
It would be advantageous if a simple technique could be devised to reduce intrabank conflicts occurring for accesses to vector data sets that guarantees maximum cache bandwidth. Despite the apparent performance advantages of such a system, to date no such system has been implemented.
The problems noted above are solved in large part by a computer system that contains a processor including a vector execution unit, scalar processor unit, cache controller and bank-interleaved cache memory. The vector execution unit retrieves strided vectors of data and instructions stored in the bank-interleaved cache memory in a plurality of cache banks to prevent intrabank conflicts.
Given a stride S of a vector, the strided vectors of data and instructions stored in the bank-interleaved cache memory are retrieved by determining R and T using the equation S=2T*R. In one embodiment, if T less than =W, W defining a cache bank 2W words wide, then, for 0 less than =i less than 2(Wxe2x88x92T), 0 less than =j less than 2P, and 0 less than =k less than 2N, words addressed i+2(Wxe2x88x92T+N)j+2(Wxe2x88x92T)k are accessed on the same cycle. P defines the bank-interleaved cache memory to contain 2P sets and N defines 2N cache banks in one set of the bank-interleaved cache memory. If W less than T less than N, then for 0 less than =j less than 2P and 0 less than =k less than 2(Nxe2x88x92T), the words addressed 2(Nxe2x88x92T)j+k are accessed on the same cycle. Finally, if T greater than =N, then the vector words are accessed sequentially at different cycles.