The present invention relates to cache memories for high-speed computers and more specifically to cache memories for vector and scalar data in a computer having vector/scalar processors.
A high-speed computer needs fast access to data in memory. The largest and fastest of such computers are known as supercomputers. One method of speeding up a computer is by xe2x80x9cpipelining,xe2x80x9d wherein the computer""s digital logic between an input and an output is divided into several serially connected successive stages. Data are fed into the computer""s input stage before data previously input are completely processed through the computer""s output stage. There are typically many intermediate stages between the input stage and the output stage. Each stage performs a portion of the overall function desired, adding to the functions performed by previous stages. Thus, multiple pieces of data are in various successive stages of processing at each successive stage of the pipeline between the input and output stages. Preferably, each successive system clock propagates the data one stage further in the pipeline.
As a result of pipelining, the system clock can operate at a faster rate than the speed of system clocks of non-pipelined machines. In some of today""s computers, the system clock cycles in as fast as one nanoseconds (xe2x80x9cnsxe2x80x9d) or less, allowing up to billion operations per second or more though a single functional unit. Parallel functional units within each processor, and parallel processors within a single system, allow even greater throughput. Achieving high-performance throughputs is only possible, however, if data are fed into each pipeline at close to the system clock rate.
As processor speeds have increased, the size of memory in a typical computer has also increased drastically. In addition, error-correction circuitry is now placed in the memory path to increase reliability. Memory-access speeds have improved over time, but the increased size of memory and the complexity of error-correction circuitry have meant that memory-access time has remained approximately constant. For example, a typical supercomputer system clock rate may have improved from roughly 8 ns to 4 ns to 2 ns to 1 ns over four generations. Over the same time period, memory-access times may have remained at approximately 60 to 100 ns. These times mean that with a 96 ns memory, the 8-ns processor accesses memory in 12 clocks, the 4-ns processor in 24 clocks, and the 2-ns processor in 48 clocks. As a result, a computer which randomly accessed data throughout memory would see almost no overall data-processing-speed improvement even if the system clock rate is increased dramatically.
One solution has been to organize data into vectors, each including a plurality of data elements, and where, during processing, each element of a vector has similar operations performed on it. Computer designers schedule various portions of the memory to simultaneously fetch various elements of a vector, and these fetched elements are fed into one or more parallel pipelines on successive clock cycles. Within a processor, the vector is held in a vector register having a plurality of vector register elements. Each successive vector-register element holds a successive element of the vector. A xe2x80x9cvector-loadxe2x80x9d operation transfers a vector from memory into a vector register. For example, a vector in memory may be held as a vector image wherein successive elements of the vector are held in successive locations in memory. A vector-load operation moves elements which include a vector into pipelines which couple memory to the vector registers. Overlapped with these vector-load operations, there could be two other pipelines taking data from two other vector registers to feed a vector processor, with the resultant vector fed through a pipeline into a third vector register. Examples of such designs are described in U.S. Pat. No. 4,661,900 issued Apr. 28, 1987 to Chen et al. and U.S. Pat. No. 5,349,667 issued Sep. 20, 1994 to Cray et al., which are hereby incorporated by reference. For example, in a well-tuned system using 2-ns pipeline clocks, the throughput can approach 500 million operations per second for a single vector processor, even with relatively slow memory-access times.
On the other hand, a scalar processor operating in such a system on somewhat randomly located data must deal with a 48-clock to 70-clock pipelined-memory access time, and must often wait for the results from one operation before determining which data to request next.
In very-high-speed vector processors, such as the Cray Y-MP C90 manufactured by Cray Research Inc., the assignee of the present invention, a computer system contains a number of central processing units (xe2x80x9cCPUsxe2x80x9d), each of which may have more than one vector processor and more than one scalar processor. The computer system also contains a number of common memories which store the programs and data used by the CPUs. Vector data are often streamed or pipelined into a CPU from the memories, and so a long access time may be compensated for by receiving many elements on successive cycles as the result of a single request. In contrast, scalar data read by one of the CPUs from one of the common memories may take an inordinate amount of time to access.
A cache is a relatively fast small storage area inserted between a relatively slow bulk memory and a CPU to improve the average access time for loads and/or stores. Caches are filled with data which, it is predicted, will be accessed more frequently than other data. Accesses from the cache are typically much faster than accesses from the common memories. A xe2x80x9ccache hitxe2x80x9d is when requested data are found in the data already in the cache. A xe2x80x9ccache missxe2x80x9d is when requested data cannot be found in the data already in the cache, and must therefore be accessed more slowly from the common memories. A xe2x80x9ccache-hit ratioxe2x80x9d is the ratio of requests which result in cache hits divided by the total of cache hits and cache misses. A system or program which has a high cache-hit ratio will usually have better performance than a machine without cache. On the other hand, a poor cache-hit ratio may result in much poorer performance, since much of the memory bandwidth is used up fetching data into the cache which will never be used.
A method and apparatus for a common scalar/vector data cache apparatus for a scalar/vector computer.
One aspect of the present invention provides a computer system. The computer system includes a common memory. The memory includes a plurality of sections. The computer system also includes a scalar/vector processor coupled to the memory using a plurality of separate address busses and a plurality of separate read-data busses wherein at least one of the sections of the memory is associated with each address bus and at least one of the sections of the memory is associated with each read-data bus. The processor further includes a plurality of scalar registers and a plurality of vector registers and operating on instructions which provide a reference address to a data word. The processor includes a scalar/vector cache unit that includes a cache array, and a FIFO unit that tracks (a.) an address in the cache array to which a read-data value will be placed when the read-data value is returned from the memory, and (b.) a destination code that specifies which of the scalar registers and vector registers into which the read-data value is to be loaded when the read-data value is returned from the memory.
In some embodiments, fetched instructions are also passed through the cache. In some such embodiment, the system allows instruction fetching through the cache to be selectably disabled. In some embodiments the system allows data fetching (i.e., both scalar fetching and vector fetching) through the cache to be selectably disabled. In some embodiments, the selective enabling/disabling of fetches through the cache of instructions and data are separately and independently specified.
In one embodiment, the cache unit fetches a different amount of data based on whether a read-data operation is for a scalar registers or a vector register. In another embodiment, the FIFO unit provides a plurality of FIFOs, each FIFO associated with one or more of the sections of the memory. In one such embodiment, the memory includes about eight sections, the FIFO unit includes an equal number of FIFOs, one of the FIFOs associated with each one of the sections, and each FIFO including about forty-eight positions.
In another embodiment, the scalar/vector processor is further coupled to the memory using one or more separate write-data busses, and wherein the write-data busses are fewer in number than the read-data busses.
In yet another embodiment, the cache unit includes a plurality of caches including a first cache and a second cache, and wherein a first subset of the sections is associated with the first cache and a different subset of the sections is associated with the second cache.
Another aspect of the present invention provides a method for caching data in a computer system such as that described above. In one embodiment, the method includes transmitting a series of addresses on each of the plurality of address busses requesting that a plurality read-data values be placed on each of the plurality of read-data busses, and for each address on each address bus, tracking both (a.) an address in the cache array to which a read-data value will be placed when the read-data value is returned from the memory, and (b.) a destination code that specifies which of the scalar registers and vector registers into which the read-data value is to be loaded when the read-data value is returned from the memory.
In one embodiment, the method further includes fetching a different amount of data based on whether a read-data operation is for a scalar register or a vector register. In another embodiment, the method further includes dividing read requests into groups of requests based on which section each read request is directed towards, and the step of tracking further includes separately tracking each of the groups of requests. In one such embodiment, the memory includes about eight sections, one of the groups associated with each one of the sections, and each group including up to about forty-eight requests.
In one embodiment if the method, the step of transmitting addresses includes transmitting a read request or a write request on each address bus, and wherein the number of write requests which can be transmitted in a given period of time is fewer than the number of read requests.
In another embodiment, the cache unit includes a plurality of caches including a first cache and a second cache and the method further includes associating a first subset of the sections with the first cache and a different subset of the sections with the second cache.
Thus the present invention provides a scalar/vector cache that can transmit a series of requests on each of a plurality of busses, each bus connected to a separate section of memory. The address of position in the cache, as well as the destination register for each data value is tracked, for example in a FIFO, such that a plurality of requests can be outstanding at any one time. Different parameters can be used for prefetching based on whether the request is for a scalar register or a vector register, thus optimizing the amount of prefetching done.