This invention relates generally to a memory addressing system and method and, in particular, to a memory addressing system and method that provides high-performance access to a multi-bank memory having an arbitrary number of banks.
For a conventional memory design to achieve the highest performance, the memory space is composed of individual components, typically called banks, whose number is typically a power of 2. The memory space is xe2x80x9cinterleavedxe2x80x9d among the banks, meaning that consecutive addresses are typically mapped to different banks. The bank number in this case may be determined by simply looking at the lowest bits in the address, A: i.e., Bank=A mod 2b, where 2b is the number, N, of banks. This approach has been used in high performance systems using as many as 512 banks of memory. Increasing the number of memory banks generally increases the throughput of memory and thus the bandwidth from the memory system to the processing unit. This throughput has traditionally been the weakest point in computer operations.
A known problem with this memory representation lies in the performance degradation it incurs when accessing arrays, or other data structures, with a stride which is even or divisible by a higher power of 2. For example, in a 16-bank system, accesses of stride 16 result in the worst performance, since only one of 16 banks is accessed. In many practical applications, array accesses have strides divisible by a high power of 2. For example, in matrices of sizes 2mxc3x972m, for mxe2x89xa7b, column accesses give only 1/N of the peak performance, since whole columns reside in the same memory bank. Similar performance degradation occurs for other types of explicit patterns of accesses (i.e., explicitly defined sequences of accesses, which are commonly referred to as regular access sequences, e.g., diagonal accesses in the above matrices).
Another problem with this memory representation is its lack of fault tolerance, especially for memory devices having multiple memory banks inside a single chip. In such a single chip device, a single bad bank (i.e., a bank having at least one unusable memory location) typically results in either the whole memory device being discarded or the number of banks used being cut to the next lower power of two. This problem is particularly significant because the yield on new memory parts can be very low; thus, a part of the capacity of a plant that manufactures such devices is devoted to making unsellable product. The problem is further exacerbated in the new generation of multiprocessor chips having embedded memory units. Such a chip may, for example, comprise 2b microprocessors and 2b memory units (with, for example, 1 to 8 Mbits of DRAM per each unit), communicating with each other over, ideally, a full 2bxc3x972b crossbar switching network. The memories in such chips may be treated in a shared memory model as a flat address space of 2bxc2x72m memory locations, where 2m is the size of each individual memory unit. Embedded memory chips are much more complex than ordinary memory units; accordingly, the cost of discarding or downgrading such chips is correspondingly greater than the cost of doing so for ordinary memory units.
Attempts to solve these problems have not been entirely successful.
For example, RAMBUS, and other similar technologies, attempt to alleviate the processor-memory bottleneck by providing faster memory operations to a non-banked memory or simply interleaved multi-bank memory. However, improvements are seen primarily for contiguous memory requests only. In addition, as the speed of processing units increases dramatically, the bottleneck remains.
Another technique, addressed particularly to the bank conflict problem, is described in P. P. Budnick and D. J. Kuck, xe2x80x9cThe organization and use of parallel memories,xe2x80x9d IEEE Trans. Computers, 20, pp. 1566-1569 (1971). Budnick et al. suggests implementing a memory using p banks of memory, where p is a prime number. In this case, bank conflicts for linear arrays can only occur for strides divisible by p. This, arguably, makes bank conflicts less likely in practice. However, there is a significant increase in the decoding logicxe2x80x94in particular, a full integer division by p circuit is required. For a requested address A, the remainder, A mod p, gives the address"" bank, while the quotient, A/p, gives the physical address within the bank. The early BSP (Burroughs Scientific Processor) had this type of memory system, with p=17. In addition to the increased decoding logic, this kind of solution is inadequate because limiting the number of banks to a prime number is too restrictivexe2x80x94for reasons of, e.g., placement, routing and interfaces, a non-prime number of banks, especially a power of 2, is a preferred choice.
Another remedy to the bank conflict problem is to use a pseudo-random number generator to generate a mapping between a logical address A and a corresponding bank. One such system is described, for example, in R. Raghavan, J. P. Hayes, xe2x80x9cOn randomly interleaved memories,xe2x80x9d Proceedings of Supercomputing, pp. 49-58, 1990. A pseudo-random generator generates a random sequence of output values for an ordered sequence of input values, but will always produce the same output value for a given input value. One problem with this technique is that it produces bank conflicts for stride 1 accesses. Stride 1 accesses are the most common access patterns in most computer applications (occurring for example when reading an instruction stream) and any significant degradation in memory performance for such accesses is therefore unacceptable. The general problem is that a pseudo-random, or truly random, mapping produces, on average, bank conflicts in not less than 1/e % (i.e., 36.78 . . . %) of accesses (where e is the base of the natural log), even for large N. This tends to substantially reduce peak performance. Additionally, certain known pseudo-random number generators may not uniformly map the address space across all banks (i.e., some banks may have more addresses mapped to them than others), which in turn increases bank conflicts and reduces performance.
Thus, while simple address translation schemes (the standard interleaving scheme or various schemes derived from Budnick-Kuck translation) create simple periodic sequences of bank numbers for sequences of fixed stride patterns (and thus suffer repeatable bank conflicts at many strides), general address scrambling mappings produce random sequences of bank numbers for arbitrary exact access sequences. These sequences of bank numbers, where the corresponding physical addresses reside, have, as explained above, statistically significant bank conflicts (within the sequence of N addresses), and, being sufficiently randomized, do not have a period less than the size of the address space to which the scrambling is applied.
Accordingly, a low complexity, fault tolerant scrambling technique that would generally provide conflict-free accesses for stride 1 access patterns, other explicit access patterns of particular importance, such as even stride patterns, power of 2 stride patterns, or diagonal and other access patterns of interest, is thus extremely desirable.
It is therefore an object of the present invention to provide a multi-bank memory addressing system and method which generally provides no bank conflicts for stride 1 access patterns and infrequent bank conflicts for other access patterns of interest. In one embodiment, a memory device is provided having a plurality, N, of memory banks comprising a plurality of addressable memory locations. Each memory location has a logical address and a corresponding physical address, the physical address comprising a memory bank number and a local address within the memory bank. The memory device comprises an address mapping system, including an address translation unit, that derives, for each logical address, the corresponding physical address. In a preferred embodiment, the address translation unit operates such that, for at least one explicit access sequence of logical addresses (for example, a sequence in which each logical address in the sequence is separated from another address in the sequence by a stride value), the derived physical addresses in the sequence of corresponding physical addresses have memory bank numbers that do not form a repetitive pattern having a period less than N+1 (or even a period less than the size of the address space) and do not on average repeat a bank number within approximately N addresses in the sequence of corresponding physical addresses.
The mapping performed by the address translation unit is referred to herein as xe2x80x9cfinite quasi-crystal mapping.xe2x80x9d The term derives from the fact that a translation unit in accordance with a preferred embodiment of the present invention produces, for most strides, a bank access pattern that is almost periodic (i.e., quasi-crystal-like); for example, the banks selected may generally be separated by a fixed value but occasionally separated by a different value. For illustration purposes, an example of a quasi-crystal mapping for a given stride in a 16 bank memory system, where the banks are numbered 0 to 15, is 0, 2, 4, 6, 8, 10, 13, 15, 1, 3, 5, 7, 9, 12, 14, . . . In this example, bank numbers in the sequence are generally separated by 2, but occasionally separated by some other number (such as 3, from 10 to 13 and from 9 to 12). A preferred quasi-crystal mapping for a particular explicit access pattern is one in which each memory bank is accessed approximately the same number of times. In a preferred embodiment the discrepancy (here this term means the deviation of a given distribution of bank accesses from the uniform one) is minimal. This discrepancy per bank here is only O(1) (order 1).
The quasi-crystal mapping is, in one embodiment, performed by scrambling the addresses a using a modular transformation of the form:
axe2x86x92A=xcex9a mod 2K
where A is a scrambled address corresponding to a, 2K is the address space (where K depends on the memory manufacturing process, and is, in the examples below, typically around 21 for a word aligned memory), and xcex9 is an odd-valued constant. The bank number in this example is derived from the top bits of scrambled address A.
In order to get a finite quasi-crystal mapping in this scheme, xcex9 is selected so as to minimize the deviation from a uniform distribution of bank numbers occurring in explicit access patterns of interest (such as various fixed stride or linear sequences of accesses in a two- or multi-dimension table, including diagonal access patterns) over the 2K address space.
The range of suitable xcex9s may be narrowed using a variety of techniques. For example, minimizing the deviation from a uniform distribution of bank numbers is similar to the problem of minimizing the deviation from a uniform distribution of fractional parts {nxc2x7xcex8}. Consequently, multipliers xcex9 that are similar to quadratic irrationalities give better uniform distribution properties. (See, e.g., H, Behnke, Zxc3xcr Theorie der Diophantischen Approximationen, I, Abh. Math Sem. Hamburg 3 (1924), pp. 261-318). One recipie, inspired by the golden section xcfx84=({square root over (5)}xe2x88x921)/2 (approximately 0.6180), is to set xcex9 to an interger close to xcfx84xc2x72M for Mxe2x89xa6K. This is not the preferred embodiment and suffers from performance deficiencies. A better embodiment is described below.
Alternatively, the range of potentially suitable xcex9s may be narrowed through the optimization of continued fraction expansion algorithms for rational numbers of the form xcex9/2K. See, e.g., Rockett and Szxc3xcsz, Continued Fractions, World Scientifica Publishing Co. Pte. Ltd. (1994). The optimization algorithm tries to find potentially suitable integer multipliers xcex9 such that two conditions happen at the same time: (a) initial terms ai in the continued fraction expansion (a0, a1, a2, . . . ) of xcex9/2M for Mxe2x89xa6K are all small (for example, 1 or 2); and (b) the number of non-zero bits in the binary (or Booth-encoded binary) expansion of xcex9 is minimal among multipliers satisfying condition (a). This non-linear optimization provides the best multiplier xcex9 needed both for scrambling and for the minimal circuit implementation of the scrambler. The final choice of xcex9 is based solely on the minimization of the deviation from the uniform distribution of bank access for various explicit access patterns over the address space. The deviation is computed through exhaustive simulation of bank access patterns for various strides, or other explicit access patterns, over the entire address space. Suitable xcex9s can be selected by exhaustive computation of deviations for all possible values of xcex9 (i.e., odd, and in the range 1xe2x89xa6xcex9xe2x89xa62K).
One skilled in the art would appreciate that various combinatorial circuit, table-lookup, or even analog solutions, rather than modular multiplication, can be used to construct the finite quasi-crystal mappings with the same effect of achieving low discrepancy mappings.
If all banks are defect free (N=2b), the bank and local address can be derived from the scrambled address A as follows: the top b bits of A are the bank number and the rest of the bits of A are the local address in the bank.
If, however, one or more banks have defects, the address space shrinks to Nxc2x72m memory locations, where N less than 2b. In this case, it is necessary to translate a logical address a with a valid range 0 . . . Nxc2x72mxe2x88x921 to a unique bank number u in the range 0 . . . Nxe2x88x921, and a local address la in the range 0 . . . 2mxe2x88x921. The complexity of the hardware logic that performs the translation is crucial.
It is especially important in multibank memory parts with an embedded logic, where multiprocessors communicate with multiple memory units inside the same chip. A general configuration of a multiprocessor chip with an embedded memory on the chip, would comprise N=2b microprocessors and N=2b memory banks (units) of size 2m each (for example, 1 to 8 Mbits of DRAM each), communicating with each other over a switching network. Such switching network could be a full 2bxc3x972b crossbar switch. As above, the total memories in this chip are treated in a shared memory model as a flat address space of Nxc2x72m memory locations. Since these translation units are needed for all multiprocessors inside the part, the ease of the hardware implementation of the address translation logic is crucial. As a practical example, we consider, here and below, b=6 case of 64 memory banks with 64 microprocessors, with each of the memory banks containing 213 cache lines (up to 32 bytes per cache line). In this case the address space in the xe2x80x9cdefect freexe2x80x9d case is 219 of addressable locations (cache lines, say). Because of a relatively large chip area, defects will be common, and the number N of good processors can go down to 32 or even lower. These parts can be salvaged only with memory translation units. Construction of on-the-fly address remapping units with the additional scrambling properties described above are a crucial application for high bandwidth fault tolerant large memory modules, and, particularly, for a large system-on-a-chip product with embedded memory as multi-bank blocks.
The present invention provides several low cost solutions to the memory translation (remapping) problem, that also use the scrambling technique to achieve better fixed stride access (and other explicit patterns of accesses). These solutions are based on the general method of finite quasi-crystal mappings to achieve high performance. In the preferred embodiment such solutions use modular multiplication (with additional low discrepancy features).
One of the possible implementations is a novel way to subdivide the address space into N banks and to perform the scrambling at the same time. For example, for an address space of 2K (as before), and N memory units (banks), where N is an arbitrary number, one first performs the scrambling mapping:
axe2x86x92A=xcex9a mod 2K
and then determines the unit number u=Axc2x7N/2K, where this memory location resides, with a local address la=Axe2x88x92uxc2x72K/N. Here N is a short constant, and 2K/N is a (longer) constant. In addition to the standard scrambling, this approach requires only 2 multiplications by short (6-bit) numbers and addition/subtraction. One can merge various modular multiplications (scrambling and translation) into one block, to speed up the whole process, so that it is preferably completed in a cycle time T (xcx9c2.5) ns.
In the above example, the number of local sub-banks per unit is 1 (S=1). This is the simplest case of the general method described in detail below. There are schemes which operate for variable number N of banks and number S of sub-banks and are significantly better than this one and we recommend them for their minimal complexity and high performance. Such low complexity techniques for deriving bank number u and local address la from the scrambled address A are provided below in the detailed description of embodiments of the present invention.