In a computer for high-performance computation, the memory has to provide for reading and writing data to the memory at a bandwidth which matches the bandwidth of processing. One approach for matching the processing bandwidth in the prior art has been the use of the cache concept, that is using a fast memory to contain a working set of data so that the processor has quick access to currently active data.
In large scale scientific computations, especially those using pipeline techniques (see for example the review by T. C. Chen in Chapter 9 of Introduction to Computer Architecture, second edition, H. Stone, Editor, Science Research Associates, Chicago, 1980), even the fast cache is taxed to the limit. Take the case, for instance, of handling a three-address floating point code: EQU A op B=C,
where A and B are addresses for operands and C is the result of the operation op on the operands A and B. If the operation is done in a pipeline fashion, using n stages, then as the computation reaches a steady state, at every machine cycle, two operands enter the computation pipe while a third operand carrying the result of the operation started n cycles earlier is stored. The total demand is three memory operations per cycle.
As the performance of the processor increases to shorter and shorter cycle times, the time required for the three memory operations becomes a bottleneck for creating faster and faster processors.
Traditional pipeline "vector" designs use very high-speed memories and/or require strong constraints on memory size and freedom of addressing. The CRAY 1, for example, uses 8 vector registers, each of 64 words, for a total of 512 floating point words, and every vector is used from the beginning and runs consecutively. Thus the CRAY 1 is constrained to a relatively small memory and a rigid addressing scheme.
One prior art approach to reducing the amount of time required for three memory operations per cycle uses the replicated memory approach. If a single conventional memory bank is used for the three address computation, the memory bank would have to perform two fetches and one store for a total of three units gainful work per cycle. The replicated memory approach supplies two identical memory banks, and stores everything in duplicate, one copy in each bank, to alleviate the fetch bandwidth. By storing everything in duplicate, the fetch bandwidth is alleviated by fetching one operand from each bank in parallel and storing the result back to both banks in parallel. Each bank in such a system performs two units of work, one fetch and one store in each cycle, rather than three. Thus the replicated memory approach of the prior art appears to be 1.5 times as fast as the single bank memory. So the replicated memory approach of the prior art requires that an instruction cycle squeeze two memory access cycles in, rather than three of the single bank memory.
Both the single bank memory and the replicated memory approach of the prior art are limited in performance because of the possibility of conflict of accesses to the memory between a read operation and a write operation. Thus the read operation must be done at a time apart from the write operation.