1. Field of the Invention
This invention relates to optimizing compilers and run-time memory arrangements, particularly microprocessor cache memories for computer systems.
2. Description of the Related Art
Computer system designers typically must strike a balance between cost and performance considerations. Advances in microprocessor performance relative to cost has tended to out-pace advances in memory access performance. Despite the general decline in computer component prices, it remains prohibitively expensive to include in a computer system enough computer memory adapted to feed data to the microprocessor at a rate that satisfies the microprocessor's ability to process the data.
One common compromise to this design conundrum, called “caching”, is to use faster but less dense memory placed nearer the CPU, in one or more hierarchical levels, coupled to cheaper and slower main memory. Data is copied into the faster memory (i.e., “cache” memory) from main memory when required by the microprocessor and may be copied back out to main memory if its value is changed (i.e., made “dirty”) during processing. For example, a level 1 (L1) cache (or caches) is typically provisioned on-board with the microprocessor while a level 2 (L2) cache placed nearby supports the L1 cache. Additional cache levels may be provided.
According to cache replacement strategies, most frequently requested data required by the microprocessor are stored in the L1 cache, the next most frequently used data are stored in the L2 cache, and so on. The least frequently used data items are stored in main memory (RAM or ROM) distant from the microprocessor. If the main memory is insufficient, additional data may be stored in mass storage devices. When the microprocessor requires information, the memory hierarchy is usually examined to satisfy the request. The L1 cache is first examined, followed by the L2 cache, main memory and, finally, mass storage. A “miss” occurs if the cache level being examined cannot satisfy the request for the information. A cache miss results in lost microprocessor cycles due to increased memory access time as a more remote and slower hierarchical level must be examined, thus degrading system performance.
FIG. 1 illustrates an exemplary system architecture employing an L1 and L2 cache reminiscent of the PowerPC® processor marketed by International Business Machines Corporation. Computer system 50 includes a central processing unit (CPU) 52, interconnected via system bus 62 to ROM 64, RAM 66 and various input/output (I/O) devices 66 such as mass storage devices, user interface devices or printing devices (all not shown). Though not shown, some of these peripheral I/O devices 66 may be connected to system bus 62 via a peripheral component interconnect (PCI) bus. PowerPC architecture provides for more than one processing unit, as is indicated by processing unit 53.
CPU 52 comprises a microprocessor core 54 and separate on-board L1 caches 56 and 58 for storing program instructions and data, respectively. The L1 caches 56 and 58 are implemented using high speed memory devices. The L1 caches 56 and 58 are included with the microprocessor core 54 on a single integrated chip 72, as is well known in the art, for providing single processor cycle access times.
An intermediate L2 cache 60 is connected to the L1 caches 56 and 58 via a processor bus 74. L2 cache 60 is also connected to a system bus 62 perhaps through a bus interface (not shown). L2 Cache 60 may comprise a 256 KB or 512 KB high speed memory chip. As L2 cache 60 lies between chip 72 and system bus 62, all information from RAM 66 or ROM 64 destined for microprocessor core 54 typically passes through L2 cache 60. It may take several processor cycles to access L2 cache 60.
A cache memory is usually divided into cache lines having a fixed number of sequential memory locations (e.g., 128 bytes). The data in each cache line is associated with its main memory location by an address tag. When a processor requires information that cannot be located in cache memory, the processing unit looks to main memory for the information. A main memory access performs a cache linefill operation, replacing a line of cache memory with a cache line size amount of data from main memory that contains the required data. As is well known, memory interfaces are configured to provide multiple byte transfers per memory access request to more efficiently transfer information between main memory and CPU. The additional bytes transferred in addition to those immediately required by the processor are thus more readily available for subsequent use.
When data from main memory is transferred to the cache, a cache line must be chosen in which to store the data. Cache memory implementations vary by the manner in which they map main memory addresses to cache lines. For direct mapped caches, a main memory address can only be mapped to a single cache line. Fully associative caches permit the mapping of an address to any cache line while N-way set associative caches map a main memory address to one of N lines, where N is typically between 2 and 16. Numerous methods exist for choosing which specific cache line is to be used when replacing cache lines (casting-out) as is well known in the art.
Various cache configurations are known to reduce memory latency times, typically by hiding operations. For example, U.S. Pat. No. 6,138,208 to Dhong et al., which issued Oct. 24, 2000 to the assignee of the present invention, illustrates a multiple level cache memory with overlapped L1 and L2 memory access. When a request for information is issued by a processor, the request is forwarded to the lower level of the cache before determining whether a cache miss of the value has occurred at the higher level. Address decoders may be operated in parallel at the higher level to satisfy the maximum number of potential simultaneous memory requests. Thus, L1 and L2 cache examination may be overlapped. Dual ported non-blocking caches are also known that allow access to data in a cache while processing an outstanding cache miss. Store-back buffers may be used when a dirty cache line needs to be castout to make room for a new line. See, for example, D. J. Shippy, T. W. Griffith, and Geordie Braceras, “POWER2 Fixed-Point, Data Cache, and Storage Control Units”, PowerPC and POWER2: Technical Aspects of the New IBM RS/6000, IBM Corporation, SA23-2737, 1994, pp. 29-45 (reproduced at rs6000.ibm.com under resource/technology/p2ppc_tech.html).
As microprocessor clock rates continue to accelerate faster than memory access rates, the effects of cache misses play an increasingly important role in system performance. Shortly after microprocessor caches were invented in the 1960's, attempts were made, by hand, to restructure programs to exploit the caches. Today, compilers and preprocessors can automatically restructure programs to improve performance using a variety of algorithms.
Historically, academic studies and software inventions that aimed to reduce the effects of cache misses have focused on reducing the number of cache miss occurrences. For example, a long-used cache-based optimization is the re-ordering of a nested set of loops with a goal of having a “stride-1” inner loop. After this optimization, the “spatial locality” of the program is typically enhanced as adjacent iterations of the inner loop access data items that are adjacent in memory. As a result, when a cache miss occurs, the returning cache line provides several required data items, and the miss penalty is effectively amortized over several elements.
Another well-known optimization, referred to as “cache-blocking”, aims to improve the “temporal locality” of a program's data accessing pattern. This optimization requires that a sub-section of the program make multiple passes over one or more data structures, and is often described using matrix multiplication as an example. When performing A×B=C, where A, B, and C are all N×N matrices, the number of uses of a particular element of the A or B matrices is roughly N. For large matrices, data re-use from memory may be relatively high; however, if a cache-blocking scheme is not employed, data re-use within the cache is often low (˜1) since the long interval between adjacent uses of a particular element results in the element being castout to make room for the flow of other elements accessed during the interval.
In the case of matrix multiplication, cache blocking can be implemented by logically partitioning the larger matrices into many smaller blocks, with the small block size being chosen such that the sub-blocks can reside in cache while each element is accessed several times. For current cache implementations, re-use factors of 10-100 are not uncommon.
A third classic approach to reducing cache miss counts is known as, “padding”. For most caches (those which are not fully associative) a given cache line (and therefore a given data element) can reside in a limited number of slots (i.e., particular cache lines) in the cache. Thrashing, or the otherwise unnecessary replacement of cache lines, occurs when the number of data items from main memory contending for a given slot (direct mapped) or set of slots (set-associative) exceeds the number of slots implemented by the cache. For example, for a 4 KB direct-mapped cache with 128-byte lines, typically data items whose main memory addresses differ by a multiple of 4 KB will contend for the same spot in the cache. If two or more of the items are frequently accessed, the data items will take turns displacing each other, resulting in additional cache misses. To reduce the cache miss count in this case, the appropriate data structures may be “padded”, often increasing the size of the program's data footprint, so that the two or more “hot” items are no longer in main memory addresses which differ by a multiple of 4 KB. By moving at least one of the data items by 128 bytes (the line size), the competing items map to distinct slots in the cache and the contention between them is eliminated.
One padding technique is disclosed in U.S. Pat. No. 5,943,691 of Wallace, et al., issued Aug. 24, 1999 to Sun Microsystems, Inc. and entitled “Determination of array padding using collision vectors”. Wallace, et al. disclose a method and apparatus for determining and resolving cache conflicts. According to the method of the invention, a cache shaped vector that characterizes the size and dimension of the cache is determined under computer control. A determination of at least one cache conflict among the arrays stored in the main memory is then determined, in addition to the conflict region in the cache for the conflicting arrays. A padding value is then determined for the arrays stored in the main memory, and the memory locations of the arrays are adjusted in accordance with the padding value to prevent cache conflicts when the data from the conflicting arrays is transferred from the main memory into the cache.
A further padding technique is described in U.S. Pat. No. 6,041,393 to Hsu issued Mar. 21, 2000 to Hewlett-Packard Company and entitled, “Array padding for higher memory throughput in the presence of dirty misses”. An array padding technique is described that increases memory throughput in the presence of dirty misses for computer systems having interleaved memory. The technique pads arrays so that the starting addresses of arrays within a target loop are separated by P memory banks, where P is a rounded integer equal to the number of memory banks divided by the number of arrays. The number of banks of separation can be incremented or decremented by 1 to also avoid path conflicts due to the sharing of buses in a typical hierarchical memory sub-system.
Since cache-blocking techniques aim to improve temporal locality, padding is often additionally employed to minimize unnecessary castouts.
These attempts to reduce cache miss effects typically assume that the penalty for a cache miss is relatively constant, which is not typically the case in current systems. The penalty incurred by a program due to cache misses is often difficult to compute, but it is a function of the number of cache misses, the penalty (memory latency) for each, and the amount of unrelated (independent) work which is available to overlap with each given miss. The latency for a given cache can be highly variable, and can depend on the degree of memory activity (increased contention often increases queuing and therefore latency) and leading and trailing edge effects of the memory and buses.
As an example of leading and trailing edge effects affecting memory responsiveness, when there are two misses from the same microprocessor in the Model 397™ Workstation from International Business Machines Corporation, the latency for a second miss can range from 25 cycles to 65 cycles, depending on the interval between miss requests and whether a dirty line needs to be castout to make room for the second line. The IBM Model 397 implements a store-back buffer so that if sufficient time elapses between the two cache misses, the processor will have emptied the store-back buffer and the store-back buffer will be available if the second miss causes a line to be castout—in this case, the castout generated during the second miss will have no effect on the latency for this miss. In the absence of castouts, the latency can vary between 25 and 40 cycles, with the latter case being back-to-back misses where little of the leading/trailing edge and set-up requirements are able to be hidden.
Cache misses that are bunched together in time (bursty), as is often seen in existing programs, suffer from longer average latencies than cache misses that are more uniformly spaced in time.
As an example, consider the following simple loop running on a microprocessor with a 64 KB cache having a cache line size of 128 bytes:
Sub-routine DOT_PRODUCT(SUM)
REAL*8 SUM, A(1 000 000), B(1 000 000)
COMMON A,B
SUM=0.
DO I=1,1 000 000                SUM=SUM+A(I)*B(I)        
ENDDO
RETURN
END
The two arrays total 16,000,000 bytes (8 bytes for each of the 2,000,000 elements). The three traditional methods for reducing cache misses do not provide performance gains for the above code. The access patterns are already stride-1 for adjacent elements, so there is no opportunity to improve spatial locality. There is no potential for re-use as every element is used only once; thus, cache-blocking will not provide a benefit. For the Ith iteration of the loop, A(I) and B(I) are referenced; the number of bytes between these references is 8,000,000, which does not divide evenly by 64K (8,000,000 mod 64K=4608). Therefore, no thrashing occurs even in a direct-mapped cache and traditional padding techniques will not provide a performance gain.
However, since the distance between A(I) and B(I) is an exact multiple of the cache line size 128 (8,000,000 128*62500), when a miss for the A element occurs in a given iteration, then a miss for the B element will occur in the exact same iteration, maybe a cycle apart.
The Model 397, and all POWER2, P2SC, and POWER3 designs, permits the detection of at least a second miss while the first miss is outstanding, and provides partial/full overlap of a trailing edge of the first miss with some portion of the second miss.
What is needed is a method and system adapted to recognize program instructions that, when executed, generate one or more subsequent caches misses while a first cache miss is outstanding, and automatically restructures the program to reduce its average cache miss penalty.