This invention relates generally to computer systems and more particularly to memory arrangements within computer systems.
Modern microprocessors have a voracious appetite for memory. Unfortunately, memory speeds have not kept up with microprocessor speeds and the gap continues to widen.
Several techniques have been developed to address this performance gap. The first is called memory interleaving. An example of a memory interleave system is shown in FIG. 1. This system was initially developed and continues to predominate in vector computers. In this memory organization, the CPU is coupled to a plurality of banks, in this case 16 (bank 0-bank 15). A databus, typically 32 to 64 bits wide, is interconnected between the CPU and each of the banks so that each bank is connected in parallel. The banks themselves comprise either static random access memory(SRAM) or more likely dynamic random access memory (DRAM) given their higher densities.
Interleaved memory improves the bandwith of the memory system by accessing multiple banks simultaneously. Each successive bank, however, is given a successive address so that a successive number of words are accessed simultaneously. The CPU then reads (or writes) from the banks in a sequential order. Even though the CPU must wait the entire memory latency period for the first data elements, successive data elements can be read out at a much more rapid pace, thereby significantly improving the bandwith of the system over a strict linear memory organization.
This organization works particularly well for vector computers because they typically operate on large arrays that include sequential elements. Thus, for each memory access, the CPU can fetch an element from each of the banks. In the example shown in FIG. 1, the CPU can access sixteen elements of an array (e.g., 0-15) while incurring only a single latency for the first element while subsequent elements are provided at the maximum achievable bandwidth of the system. Subsequent group developments of the array can be provided by the interleaved memory system at the same throughput.
The performance of an interleaved memory system can be substantially degraded by certain access patterns. Consider the following Fortran code example:
______________________________________ Program EXAMPLE Real *8 A (128, 128), P (128, 128) . . . Do 10 I = 1, 128 Do 10 J = 1,128 A (I, J) = A (I, J) + B (I, J) * 3.14 . . . 10 continue ______________________________________
In a vector computer, a vector load of A(I,J) produces a sequence of memory requests such as A (1, 1) A (1,2), A (1,3) . . . etc., to memory. This sequence of memory requests are addressed to the same memory bank (e.g., bank 0) because Fortran is a column major language for which stores are raised column by column, as compared to row major order which stores arrays row by row. Thus, the above-listed sequence of memory requests are separated by an entire column in physical memory, i.e., 128 double words. Since the column size is an integer multiple of the number of banks, the successive memory requests will go to the same bank. Accordingly, the memory bandwith would be significantly degraded and latency much increased.
There are two compiler techniques that can be used to effectively alleviate this contention for the same bank in traditional low order interleaving. The first is to interchange the loop variables so that successive memory requests are to successive physical memory locations. Using this technique, the above-listed Fortran code would be as follows:
______________________________________ Program EXAMPLE Real *8 A (128, 128), P (128, 128) . . . Do 10 J = 1,128 Do 10 I = 1,128 A (I, J) = A (I, J) + B (I, J) * 3.14 . . . 10 continue ______________________________________
The optimized code now generates the sequence of memory requests such as A (1,1), A (2,1), A (3,1), . . . etc. These successive requests go to successive memory banks so that there is no contention for the same bank during a given vector load. Thus, the memory system can operate at its maximum bandwidth.
The second compiler technique to avoid bank conflict in traditional low order interleaving systems is to change the dimensions of the array so that successive memory requests do not go to the same bank. In the example above, the two arrays can be dimensioned as A (129, 128) and B (129, 128). In this case, consecutive elements in the array are separated by 129 double words, which is to the next consecutive bank (i.e., 129 mod 16=1).
Similar problems exist in high performance desktop systems, commonly referred to as "workstations," which use a different interleaving scheme. Workstations typically use a cache line interleaving memory system wherein consecutive cache lines are stored in consecutive banks. An example of this cache line interleaving is shown in FIG. 2. The cache line memory interleaving memory system in FIG. 2 assumes a cache line size of four consecutive double words. Accordingly, the first four double words (0-3) are stored in bank 0, the next four double words in bank 1 (4-7), etc.
The example loop shown above is often interchanged in the same way as for vector processors in order to exploit the spatial locality of cache lines. As long as there is a hit in the cache 11, the consecutive memory requests would be to consecutive elements in a cache line. When a miss occurs, however, this loop transformation may cause a performance degradation in the memory system. Assuming the loop variables are interchanged as shown above, the order of references is likely to be A (I, J) B (I, J), A (I+1, J), B (I+1, J) . . . , etc. If A (I, J) incurs a cache miss, then B (I, J) may also miss in the cache. In that case, the cache makes a memory request for A (I, J), which may be followed by a request for B (I,J). These two data elements, however, are located in the same bank. Hence, a bank conflict occurs which will delay the return of B (I, J).
Modern optimizing compilers avoid this bank conflict by spacing the two arrays out by one cache line so that a miss of an A reference and a miss of a B reference go to different banks. This is accomplished by declaring a dummy array between the two arrays in the example above where the dummy array has a size equal to one cache line so that the last element of the A array and the first element of the B array are separated by one cache line. This scheme works well in most situations. As will be described further below, however, I have discovered that this scheme creates bank conflicts in the presence of dirty misses. A dirty miss is a miss in a cache in which one or more elements in the corresponding victim cache line have been changed so that the contents of the cache line must be written back to main memory. This write-back causes the memory bank to be busy for many cycles. Accordingly, a need remains for an optimizing technique that deals with bank conflicts due to dirty misses.
A secondary search conflict can be attributed to data access path sharing. In a practical memory subsystem, such as shown in FIG. 3, memory banks are usually organized as a hierarchy rather than linearly interleaved as shown in FIGS. 1 and 2. A typical 16-bank memory subsystem may be organized as shown in FIG. 3. In that case, a CPU 10 is coupled to a master memory controller 12 via bus 14 over which data, control and address signals are transmitted. The master memory controller 12 determines which of its two main branches 28, 30 the memory request resides in and forwards the memory request onto that branch. Each branch, in this case, includes eight banks. The left hand branch 28 includes the even number banks and the right hand branch 30 includes the odd number banks. The even numbered banks are further subdivided so that half of the even numbered banks (0, 4, 8, and 12) are one grouping 16 while the remaining even banks are organized into a separate grouping 18. Each group of banks 16, 18 is coupled to a multiplexer 20 via a respective bus, A, B. Multiplexer 20 then transmits data from cache lines 28 to either one grouping or the other depending on the position of the multiplexer, as controlled by the master memory controller 12. Groupings 16 and 18 are controlled by slave memory controllers 22 and 24, respectively. The slave memory controllers receive control and address information from the master controller 12 and provide the corresponding data elements to the multiplexer 20 in response thereto. The right hand branch 30 is organized in substantially the same way and is therefore not discussed further.
This organization leads to a path conflict between banks that share a common bus. For example, in FIG. 3, banks 0, 4, 8 and 12 have a common sharing bus A. Thus, when bank 0 is accessed, bank 4 cannot respond immediately until the bus is released by bank 0. Accordingly, a need remains for an optimizing technique to deal with these path conflicts in hierarchical memory subsystem organizations.