The present invention relates to computers, and more particularly, to high-speed, parallel-processing computers employing horizontal architectures and multi-module memory systems.
Horizontal processors have been proposed for a number of years. See for example, "SOME SCHEDULING TECHNIQUES AND AN EASILY SCHEDULABLE HORIZONTAL ARCHITECTURE FOR HIGH PERFORMANCE SCIENTIFIC COMPUTING" by B. R. Rau and C. D. Glaeser, IEEE Proceedings of the 14th Annual Microprogramming Workshop, Oct. 1981, pp 183-198 Advanced Processor Technology Group ESL, Inc., San Jose, Calif., and "EFFICIENT CODE GENERATION FOR HORIZONTAL ARCHITECTURES:COMPILER TECHNIQUES AND ARCHITECTURAL SUPPORT" BY B. Ramakrishna Rau, Christopher D. Glaeser and Raymond L. Picard, IEEE 9th Annual Symposium on Computer Architecture 1982, pp. 131-139.
Horizontal architectures have been developed to perform high speed scientific computations at a relatively modest cost. As a consequence of their simplicity, horizontal architectures are inexpensive when considering the potential performance obtainable. This potential performance is realized when the multiple resources of a horizontal processor are scheduled effectively. An example of one horizontal computer is described in the above cross-referenced application and the applications referenced therein.
In computer systems, the processing units execute programs which require accesses to the memory system. Some of the accesses to the memory system are read (fetch) operations in which information from an address location in the memory system is accessed and returned to the processing unit for use in further execution of the program. In statically scheduled computer systems, the return of the accessed information in response to a request from the processing unit is in a predetermined order and at a predetermined time. Generally, information is returned to the processing unit from the memory system in the same order that the processing unit makes a request for the information.
It is often necessary or desirable in computer systems for one or more ports (from one or more processing units, I/O devices or other system units) to simultaneously initiate accesses (by generating memory addresses) to a shared memory system for fetching and storing information. The amount of time required to return requested information from a memory system to the processing unit after a request for the information by the processing unit is the actual latency time of the memory. The memory latency time affects the overall efficiency in which the processing unit can complete the execution of programs. In general, it is desirable to have the actual memory latency as short as possible so that the processing unit is not required to wait for the memory system in order to continue processing.
In order to increase system speed, memory systems have been constructed using interleaved memory modules. The use of multiple memory modules increases the bandwidth of the memory system by directing successive memory requests to different ones of the memory modules. Since a request directed to one module can be processing at the same time that a request is processing at another module, the rate at which the memory system can return requested information is greater than the rate of any individual module. For this reason, the memory system has a higher bandwidth as a result of using multiple memory modules operating in parallel.
As speed requirements of computers have increased, memory systems employing greater numbers of parallel memory modules have been developed. However, merely increasing the number of memory modules does not guarantee higher memory speed or a higher number of memory accesses during a period of time. The number and speed of total memory accesses is limited by the conflicts that occur in accessing the individual memory modules.
Memory modules are usually constructed so that requests to access a memory module, in response to a sequence of input addresses, can only be accommodated one address at a time in sequential order. Multiple requests to a single memory module must have a conflict resolution mechanism that orders the requests. Theoretically, the number of memory modules can be increased in order to reduce such conflicts but, in conventional systems, the total achievable rate of accesses to a memory system does not increase in proportion to an increase in the number of memory modules forming the system.
In a conventional multi-module memory system, part of the input address, Ai, to the memory system defines the particular one of the memory modules in which the physical address is actually located. Let M be the number of memory modules, where M=2.sup.m and m is an integer. Typically, m contiguous bits of a given input address Ai specify which one of the M memory modules includes the physical address Ai. In one example where m equals 6 and M equals 64, sixty-four memory modules exist and six of the input address bits, for example Ai(7, 2), uniquely define one of the 64 memory modules.
When an input sequence, Ai($), of input addresses Ai, where $ is the index from 0 to $max of Ai, is presented to a memory system, it is desirable that the physical addresses in the memory modules be uniformly distributed and not clustered. If the distribution in response to that input sequence tends to be random, then generally the distribution will be uniform among the memory modules. Such a distribution is called pseudo random. When the distribution tends to be random, the probability that a memory module will be busy when an input address makes a request to that module is lower than when not random. If the memory module is not busy when a request is made, then of course the memory system operates more quickly since time is not lost in waiting for the memory module to become not busy. Therefore, in general, memory systems suffer a loss in speed from memory access conflicts when a higher frequency of accesses results for some of the memory modules relative to the frequency of accesses for other of the memory modules.
A pseudo random distribution of accesses among memory modules is important for both short address sequences and for long address sequences. For example with 64 memory modules and for a short sequence with S from 0 to 63, representing 64 different input addresses Ai in the input sequence Ai(S), it is desirable that the physical addresses (that is, the physical module actually having the address location) be distributed one each in each of 64 different memory modules. Similarly, for a long sequence (With $ much larger than 64), it is desirable that each of the 64 memory modules tends to have an equal number of physical addresses in response to the input addresses irrespective of the nature of the input sequence of addresses.
While it is desirable to access all memory modules uniformly with equal frequency, certain types of programs generate input address sequences that address memory modules in a manner that tends to cause non-uniform accessing among the memory modules. Such non-uniform memory accessing frequently arises in performing matrix computations. For example, a two-dimensional matrix might have its matrix values stored with column values in a single memory module. With such storage, row matrix values and forward-diagonal and reverse-diagonal matrix values can be accessed from different memory modules. However, When the column matrix values are accessed serially out of the single memory module the accessing rate is materially reduced because of the module access conflicts which arise.
In order for an interleaved memory system to be effective and have accesses uniformly distributed among the memory modules, the organization of addresses in the memory modules must be appropriately determined.
The typical organization of an interleaved memory system uses m contiguous bits of the input address to define the memory module. Such an organization has every M-th word assigned to a given memory module, where M is the number of memory modules. Usually, M is a power of 2. Less frequently, M is some other number such as a prime. Such a prime number memory system is described, for example, in U.S. Pat. No. 4,051,551 to Lawrie et al. With such organizations, however, input address sequences Ai(S) are found, in actual practice, that map non-uniformly and more frequently to the same module and therefore, the full benefits expected from interleaving are not achieved.
Another memory system organization uses m non-contiguous bits from the input address (where M=2.sup.m) and assigns all words with the same addresses in those m bits to the same memory module. This non-contiguous address bit organization is not susceptible to as long-term a concentration of references to one module as the previous m contiguous bit approach. However, when m is much less than the number of address bits (which is almost always the case), there is still a susceptibility to a short-term concentration of references to a module.
In one example of a memory system where m=6 and the number of bits in the word address is 29, the number of contiguous address bits which do not enter into the determination of the selected memory module cannot be guaranteed to be less than 4. A contiguous set of not more than 4 bits is achieved if the 6 bits that are used are evenly distributed throughout the 29 address bits. Even with such a distribution, however, there can be at least 16 consecutive references to the same memory module when the appropriate stride exists in the input address sequence. This short-term (up to 16 in the example described) non-uniform concentration of references to the same memory module is as detrimental to performance (assuming realistic queueing buffer capacities) as is a long-term non-uniform concentration.
Methods for avoiding non-uniform accesses among memory modules have been proposed which use address transforms. In connection with address transforms, the terms "real address" and "input address" are used to refer to the address presented to the memory system prior to an address transform and the terms "physical address" and "output address" are used to refer to the address after transform.
In connection with an address transform, each input address, Ai, is transformed by a transform, H, to form an output address, Ao. In this specification, the number of bits, I, for an input address is designated in parentheses, Ai(I-1, I-2, . . . , 0). For example, with 29 addresses bits (I=29) the designation is Ai(28, . . . , 0), and similarly, the number of bits in a transform output address is indicated for the same 29 bit example as Ao(28, . . . , 0).
In general, the expression for the transform of a single input address Ai to a single output address Ao is given as follows: EQU Ai[H]=Ao Eq.(A)
A number of bits, usually m consecutive bits, of the output address Ao defines the particular one of the memory modules in which the output address is physically located. Usually 2.sup.m memory modules are defined. In one example where 64 memory modules exist, the output address bits Ao(7, . . . , 2) uniquely defines one of the 64 memory modules. The transform of the input address Ai to form the output address Ao frequently uses g of the I input address bits in determining the output address module bits Ao(7, . . . , 2). The number g of input address bits is usually greater than m of output address bits. In one example, the memory modules are addressed on a word basis and the low-order b address bits Ai(1,0) and Ao(1,0) define the byte address within a word.
A sequence of input addresses, each input address of the form Ai, is designated as Ai{S} and, with a transform, H, a sequence of output addresses Ao{S} is formed where each output address of the form Ao all given as follows: EQU [Ai{$ }][H]=Ao{$ } Eq.(B)
where,
Ai=input address PA1 Ai{$}=sequence of $max input addresses indexed from 0 to ($max-1) PA1 H=address transform PA1 Ao=output address of a memory module PA1 Ao{$}=sequence of output addresses indexed from 0 to ($max-1) PA1 An equal number of input addresses are mapped into each memory module. PA1 The mapping is onto, that is, some input address will map into any given output address. PA1 The concentration in any particular memory module of output addresses in response to any sequence of input addresses is not greater than would be the oase with a truly random output address sequence irrespective of the nature of the input address sequence. The distribution of output addresses in memory modules tends to be uniform among the memory modules for both short and long sequences of input addresses. PA1 The mapping is effective over the entire range of interleaving for which the memory system is designed including 2-way, 4-way, 8-way, 16-way or other interleaving.
In Eq.(B) the sequence Ai{S} of input addresses Ai presented to a memory system is indexed from 0 to (Smax-1) and the sequence Ao{$} of output addresses Ao similarly is indexed from 0 to (Smax-1). For a 29-bit example, the designation is Ai(28, . . . ,0){$} and Ao(28, . . . ,0){S}.
When sequences of input addresses Ai(S) are transformed to sequences of output address Ao($), it is desirable that the distribution of the output addresses into physical memory modules tends to be random, that is, pseudo-random. For example, for the sequence $ indexed from 0 to 63, representing 64 different input addresses in the input sequence Ai($), it is desirable that the output addresses be distributed one each in each of the 64 memory modules. Similarly, as Smax grows much larger than 64, it is desirable that each of the 64 memory modules tends to have an equal number of output addresses Ao resulting from the input addresses Ai irrespective of the nature of the input sequence Ai(S) of input addresses.
In general, the function of an address transform is to assign the location of physical addresses in storage to memory modules so that no non-artificial sequence of input addresses exhibits statistically more frequent long-term or short-term accesses to individual memory modules.
Also for an effective transform, the rate of accesses to memory modules will increase in proportion to an increase in the number of memory modules with substantial independence of any non-artificial memory referencing pattern established by the input sequence of memory input addresses.
One example of an address transform is described in U.S. Pat. No. 4,484,262 to Sullivan et al. In that patent, a truly random address transform was described which did not disclose a mechanism that insured the repeatability or one-to-one mapping properties that are required for practical systems.
Transform repeatability insures that the same input (real) address always maps to the same output (physical) address. This property is useful in a computer system so that different requests to the memory system are assured of accessing the same memory location. In a practical computer system, it is desirable that the transform mapping not be truly random, but rather be deterministic and therefore can be described as pseudo-random.
Transform one-to-one mapping insures that no more than one input (real) address maps to the same output (physical) address. This property is useful in a computer system to insure that information which is needed in the system is uniquely defined and not confused with or destroyed by other information. However, there may be output (physical) addresses with no corresponding input (real) addresses in some computer systems.
In accordance with the above background, it is an objective of the present invention to provide an improved computer system which provides address transforms that avoid consecutive references to the same memory module for many different types of input address sequences so that a uniform distribution of accesses occurs among memory modules.