Memory interleaving and multiple memory access ports are the key to high memory bandwidth (i.e., the number of words that can be accessed per second) in vector processor systems. Memory interleaving is to partition a memory system into independent memory modules such that sequential memory addresses fall into sequential memory modules. The bandwidth of an interleaved memory system can be increased over that of a single memory module through utilization of overlapped access to multiple memory modules.
A vector processor system may have one or more processors operating concurrently, each processor may have one or more memory ports of a multi-port interleaved memory system and can request a vector of data via a memory port. Each processor in the system may include a local memory which serves as an intermediate high speed buffer so that both memory port and arithmetic pipeline can access data in the local memory without destructive interference.
Each memory port of a multi-port interleaved memory system serves a request for a vector of data by accessing elements (words) of the vector sequentially, one word at a time. A vector of data is typically specified by three parameters, the starting address SA, the stride S (address offset between successive elements of the vector) and the length VL for a vector of VL words with the address sequence &lt;SA,SA+S, SA+2S, . . . ,SA+(VL-1)S&gt;. When the memory system is accessed via multiple concurrently operating memory ports, each requesting access to a memory location every clock cycle, memory access conflicts may arise. Such access conflicts lead to a decrease in memory bandwidth.
Numerous commercial vector supercomputers employ variations of the multiple memory port interleaved memory design paradigm for example see Stone, H. S., "High-Performance Computer Architecture (Second Edition) ", Addison Wesley Publ. Co., (1990) , and Hennessey J. L. and Patterson, D. A., "Computer Architecture: A Quantified Approach:", Morgan Kaufman Publ. Inc. (1990).
As used herein, an N-way interleaved memory system is said to be comprised of N memory modules such that each memory module comprises 2.sup.m memory banks wherein m is any integer. A memory module comprising one memory bank, i.e., when m=0, will be called a single bank memory module. The single memory bank contains a fixed number of memory cells each of which contains a single word of data, and only one word can be accessed at a time. Please see the section entitled SYMBOLS AND NOTATION CONVENTIONS, hereinafter for a full discussion of the symbols and conventions used herein.
For a typical N-way interleaved memory system, it is known that N successive memory addresses with stride S are directed to N/GCD(N,S) memory modules wherein GCD(N,S) denotes the greatest common divisor of the integers N and S. When N and S are relatively prime (i.e., GCD(N,S)=1) the N successive memory accesses are directed to N distinct memory modules of the memory system, see stone. Hence, the maximum number of usable memory modules is N/GCD(N,S) when serving a vector with stride S in an N-way interleaved memory system. If N and S are relatively prime, then the maximum number of usable memory modules is N, see for example Kogge, P. M., "The Architecture of Pipelined Computers", Hemisphere Publishing Corporation (1981). Maximizing the number of usable memory modules in an N-way interleaved memory system for each vector access helps reduce memory access conflicts among concurrent access of multiple vectors to memory, and, therefore, sustain high memory bandwidth. In any given problem many values of stride are employed to access different vectors. Stride is, however, application dependent.
Commercial N-way interleaved memory systems generally suffer unavoidable memory access conflicts as N is generally chosen to be a power of 2, that is, of the form N=2.sup.k, where k is an integer This choice for the value of N is dictated by the structure of the two equations used to map any memory address A to location L in the memory module B. The two equations follow: EQU L= A/N (1) EQU B=A mod N (2)
Whenever N=2.sup.k, the computation of equations (1) and (2) is greatly simplified and executed in O(1) gate delays (e.g. a constant number of gate delays), i.e., the division and mod N operations are reduced to trivial operations.
Any vector access with an odd value of stride (GCD(N,S)=1) maps to N consecutive addresses located in N distinct memory modules. However, when the stride is an even number, serious memory access conflicts can occur which dramatically degrades the performance of the N-way interleaved memory system, see Hennessey.
Although individual stride problems in a given application can be avoided by judicious and laborious program restructuring, the only way to sustain peak memory bandwidth in an N-way interleaved memory system for all values of the stride (i.e. for all applications) is to employ a prime number of memory modules. The number of usable memory modules is one when the stride is a multiple of the prime number which can be avoided by a single programming intervention.
Significant problems arise when N is a prime number. When N is prime, the computation of equations (1) and (2), which involves both explicit and implicit division operations by a prime number, requires not only an excessive amount of hardware but also multiple clock cycles. These consequences completely offset the performance advantages gained by the use of a prime number of memory modules. The formidable problems associated with the use of a prime number of memory modules in interleaved memories in vector processor systems have remained unresolved, heretofore.
An early attempt to use a prime number of memory modules in a memory system for application in a parallel array processor architecture was disclosed. in Lawrie et al., U.S. Pat. No. 4,051,551, hereinafter referred to as Lawrie and Vora. Lawrie and Vora disclose a parallel memory system comprised of N=17 single bank memory modules serving a fully parallel array of 16 processors performing the same instruction on different elements of a vector of length 16 (i.e., comprised of 16 words). The 17 memory modules of the memory system operate in lock-step to access a vector of 16 words in fully parallel. It must be emphasized that the architecture of this computer, the Burrough's Scientific Processor, is significantly different from pipelined vector computer architecture, see Kuck, D. E. and Stokes, R. A., "The Burroughs Scientific Processor (BSP):, IEEE Transactions on Computers, Vol.C-31, pp. 363-376 (May, (1982).
Lawrie and Vora were unsuccessful in solving the computational problems associated with division by a prime number. In order to compute equation (1), Lawrie and Vora arbitrarily replace N (the prime number) by the nearest number less than N which is expressible as a power of 2. In the case of N=17, N was replaced in the division operation by 16. This approach results in under-utilization of memory. The Lawrie and Vora approach in some cases results in up to 50% wastage of memory space.
It is a commonly held notion in the art that it is not possible for cost-effective implementation of prime-way interleaved memory systems in commercial high-performance computing systems, i.e., memory systems wherein the number of memory modules N is a prime number.