Software performance is directly linked to memory performance and processor speed. A processor is the part of a computer which controls the other parts. A processor typically includes units for fetching instructions and processing them to produce signals which control the other parts of the computer. The speed of a processor affects how quickly the processor performs operations. Computer memory, in a general sense, is a medium or device that can hold data in a format that the processor can use. The data in computer memory usually include computer instructions as well as data operated upon according to the computer instructions.
In a general sense, data stored in memory is often organized into data objects, which may in turn include attributes or fields. Data objects can correspond to a wide variety of data items or data structures, such as program variables and constants, arrays, records, or other aggregate data structures.
One way to improve the performance of software on computer systems is to install large amounts of fast memory. However, installing large amounts of fast memory is expensive—generally, the faster the memory, the more expensive it is. In addition, larger memory is generally slower to access than smaller memory due to the lookup circuitry. So, even if fast memory is cheap, smaller memory will still be faster than larger memory, and there will still be a memory hierarchy. An economical solution to this problem is to have a memory hierarchy within the computer system, which takes advantage of the cost/performance benefit of different memory technologies. The hierarchy is based on memories of different speeds and sizes, which are organized into several levels, each smaller, faster, and more expensive per byte than the next level. To work correctly, the levels of the hierarchy usually encompass one another. For example, all of the data at the fastest, topmost level in the hierarchy is also found in the level below it, and so on, until the bottom level of the hierarchy. Ideally, the memory system appears to work almost as fast as the fastest memory, yet does not cost much more than the cheapest memory.
The importance of the memory hierarchy has increased with advances in processor (CPU) performance. CPU performance has improved and continues to improve at a dramatic rate (performance nearly doubles every other year). Memory chip performance, on the other hand, has progressed much slower. As a result, there has been a widening processor-memory performance gap.
I. Cache Memory
In the memory hierarchy, cache memory is the name generally given to the first, topmost level of memory encountered by a processor. It is typically the fastest memory in terms of performance, as well as the most expensive per byte. Some processors have separate caches for instructions and other data, where both can be active at the same time. Caches can use various kinds of addressing (e.g., direct mapped, fully associative, set associative). Effective cache memory utilization is an important determinant of overall program performance and may help bridge the processor-memory performance gap. Thus, improving cache memory performance is vital to software performance.
More specifically, cache memory is typically small, fast memory holding recently accessed data, designed to speed up subsequent access to the same data. In usual operations, for example, when data is read from or written to main memory, a copy is also saved in the cache along with the associated main memory address. The system monitors addresses of subsequent reads to see if the data is already in the cache. If the data is already in the cache (called a cache hit), then the data is returned immediately and the main memory read is aborted (or not started). If the data is not cached (called a cache miss), then the data is fetched from main memory and also saved in the cache, which can be time-consuming and can cause processing to wait for the data to be loaded. In fact, the cache is built from faster memory chips than main memory so a cache hit takes much less time to complete than a normal memory access. Moreover, the cache may be located on the same integrated circuit as the CPU in order to further reduce the access time. In this case, the cache is often known as “primary” since there may be a larger, slower secondary cache (i.e., a lower level in the hierarchy) outside the CPU chip.
When the cache is full and it is desired to cache another block of data then a cache entry is selected to be written back to main memory or “flushed.” The new block is then put in its place. Which entry is chosen to be flushed is determined by a replacement algorithm.
An important characteristic of a cache is its hit rate—the fraction of all memory accesses which are satisfied from the cache. The hit rate depends on the cache design as well as its size relative to the main memory. The size is limited by the cost of fast memory chips. The hit rate also depends on the access pattern of the particular program being run (the sequence of addresses being read, written, etc.). Conventionally, caches rely on two properties of the access patterns of most programs: (1) temporal locality—if something is accessed once, it is likely to be accessed again soon; and (2) spatial locality—if one memory location is accessed, then nearby memory locations are likely to be accessed. In order to exploit spatial locality, caches often operate on several words (units of storage in a computer) at a time—a “cache line” or “cache block.” Main memory reads and writes are whole cache blocks.
A cache block is the smallest unit of memory that can be transferred between the memory and the cache. Rather than reading a single word or byte from main memory at a time, each cache entry usually holds a certain number of words, and a whole block is read and cached at once. This takes advantage of the principle of spatial locality: if one location is read, then nearby locations (particularly, following locations) are likely to be read soon afterward.
When a computer program is running on a computer, instructions and data for the computer program are laid out in memory. The layout of instructions and data in memory can affect cache performance. In a good memory layout, instructions/data that are accessed around the same time during program execution are located close together in computer memory, which makes them more likely to be in the same cache block, increasing the cache hit rate. In a poor memory layout, instructions/data that are accessed around the same time during program execution are not located close together in computer memory, leading to more cache misses and more normal (slower) memory accesses.
Certain metrics are used to determine the effectiveness of a memory layout, in addition to the metrics of cache hits, misses, and hit rate (which are discussed above). These other metrics used include miss rate, miss penalty, and miss reduction. Cache miss rate is the fraction of accesses that are not in cache memory. Miss penalties are the additional time required to service a cache miss (e.g., to perform a normal, non-cache memory access). Cache miss reductions indicate the number or proportion of misses that could be avoided based on a given layout.
II. Exploring Different Memory Layouts
Various software tools have been developed to analyze and suggest particular memory layouts for computer programs. Memory and cache behavior studies of general-purpose programs indicate that a small fraction of data objects (around 10%) are responsible for most data references and cache misses (around 90%). For example, see the articles: (1) Chilimbi, “Efficient Representations and Abstractions for Quantifying and Exploiting Data Reference Locality,” SIGPLAN Conf. on Prog. Lang. Design and Impl., June 2001 (“Chilimbi 1”); and (2) Rubin et al., “An Efficient Profile-analysis Framework for Data Layout Optimizations,” Symp. on Princ. of Prog. Lang., January 2002 (“Rubin”). Data objects that are accessed relatively frequently are called “hot” data objects. These hot data objects are attractive targets for cache locality optimizations since rearranging these hot data objects in memory can reduce cache miss rates.
While cache-conscious data placement techniques exist, they generally suffer from two primary drawbacks. First, their memory placement decisions are guided by object/field frequency or pairwise affinity profiles, which only provide rough approximations of a program's temporal data reference behavior (the way objects, variables, and other data are accessed by a program over time). Also, their layout decisions are determined by fairly ad-hoc heuristics. These drawbacks seriously limit performance because layouts guided by inexact profiles may be far from optimal. Moreover, prior layout heuristics generally are not both robust and effective (i.e., they do not work consistently well for a wide variety of programs).
Object frequency profiles, for example, typically rely on processor access history to make caching decisions and keep the most frequently used objects in cache. Towards the end of execution, object frequency profiles typically cannot find an optimal cache layout solution because some objects may accumulate large reference counts and never become candidates for replacement, even if the objects are no longer active. Aging techniques counterbalance the accumulation effect, but they add additional levels of complication to the profile.
A pairwise affinity profile maintains information on how many cache misses would be caused if a pair of objects are mapped to the same cache block. One drawback is that the profile limits the number of objects it evaluates to two. Moreover, to find an optimal solution, a pairwise affinity profile needs to compute every combination of object pairings and the effects of the object pairings on the overall cache layout. These computations require intensive and complicated heuristics that may ultimately result in failing to find a good solution for many kinds of caches.
Some related works that attempt to solve memory performance issues include the articles: (1) Seidl and Zom, “Segregating Heap Objects by Reference Behavior and Lifetime,” Eighth Intl. Conf. on Arch. Support for Prog. Lang. and Operating Sys., pages 12-23, October 1998 (“Seidl”); (2) the Rubin article; (3) Chilimbi et al., “Cache-conscious Structure Layout,” SIGPLAN Conf. on Prog. Lang. Design and Impl., May 1999 (“Chilimbi 2”); and (4) Calder et al., “Cache-conscious Data Placement,” Eighth Intl. Conf on Arch. Support for Prog. Lang. and Operating Sys., pages 139-149, October 1998 (“Calder”).
The Seidl article describes allocating heap objects in four pre-defined memory “arenas” based on a predicted hit-miss ratio. The four arenas are four different areas of memory, labeled highly referenced, not-highly reference, short lived, and other. In a similar fashion, the Rubin article describes using a search-based learning technique to classify heap objects according to runtime characteristics such as allocation calling context, object size, and other like characteristics. The Rubin article also describes, based on this classification, allocating objects in separate heap arenas. The techniques in the Seidl and Rubin articles improve virtual memory performance by increasing page utilization. The problem with these prior art references, however, is that they limit the number of memory arenas used for allocations, they emphasize individual objects over streams of data, and, ultimately, have little, if any, impact on cache performance because their coallocation analysis and/or enforcement techniques are too coarse (i.e., they address chunks of data too large) to efficiently coallocate objects for benefits at the cache level.
The Chilimbi 2 article describes ccmalloc, a cache-conscious heap allocator that uses programmer annotations to allocate contemporaneously accessed data objects to be in the same cache block. One major drawback to this technique is that it requires programmer intervention—the programmer manually places the annotations in the software, which can be time-consuming. Another drawback is that the technique requires that the programmer accurately assess the run-time behavior of the software, which may be difficult to discern for some scenarios and may change in different scenarios.
Calder describes applying placement techniques developed for instruction caches to data. Specifically, Calder describes a compiler-directed approach that creates an address placement scheme for stack variables, global variables, and heap objects in order to reduce data cache misses. Calder describes calculating a temporal relationship graph (“TRG”). The TRG improves performance for stack objects and global variables but does little to improve the caching performance of heap objects. Further, drawbacks to the TRG include the fact that the TRG does not use hot data stream profiles and that it depends on an arbitrary temporal reference window size, which may not improve cache miss rates for some programs. For additional details, see the respective papers.
Thus, in contrast to prior cache-conscious data placement techniques and tools, techniques and tools are needed that produce a good memory layout in terms of cache performance, that effectively use global temporal access information for programs, that effectively cluster objects together for different kinds of caches, that lead to significant cache miss reductions in practice, and that do not rely on ad-hoc heuristics.