Caching is a common technique in computer systems to improve performance by enabling retrieval of frequently accessed data from a higher-speed cache instead of having to retrieve it from slower memory and storage devices. Caching occurs not only at the level of the CPU itself, but also in larger systems, up to and including caching in enterprise-sized storage systems or even potentially globally distributed “cloud storage” systems.
For example, caches are commonly included in central processing units (CPUs) to increase processing speed by reducing the time it takes to retrieve information from memory or other storage device locations. As is well known, a CPU cache is a type of memory fabricated as part of the CPU itself. In some architectures such as x86, caches may be configured, hierarchically, with multiple levels (L1, L2, etc.), and separate caches may have different purposes, such as an instruction cache for executable instruction fetches, a data cache for data fetches, and a Translation Lookaside Buffer (TLB) that aids virtual-to-physical address translation. Access to cached information is therefore faster—usually much faster—than access to the same information stored in the main memory of the computer, to say nothing of access to information stored in non-solid-state storage devices such as a hard disk.
On a larger scale, dedicated cache management systems may be used to allocate cache space among many different client systems communicating over a network with one or more servers, all sharing access to a peripheral bank of solid-state mass-storage devices. This arrangement may also be found in remote “cloud” computing environments.
Data is typically transferred between memory (or another storage device or system) and cache as cache “lines”, “blocks”, “pages”, etc., whose size may vary from architecture to architecture. In systems with an x86 architecture, for example, the transfer size between CPU caches and main memory is commonly 64 bytes. In systems that have a caching hierarchy, relatively slow memory (such as RAM, which is slow compared to processor cache) may be used to cache even-slower memory (such as storage devices). Note also that, in such systems, the transfer size between levels of the cache generally increases, e.g. typically 64 bytes from DRAM to processor cache, but typically 512B to 64 KB between disk and DRAM-based cache. Just for the sake of succinctness, all the different types of information that is cached in a given system are referred to commonly here as “data”, even if the “data” comprises instructions, addresses, etc. Transferring blocks of data at a time may mean that some of the cached data will not need to be accessed often enough to provide a benefit from caching, but this is typically more than made up for by the relative efficiency of transferring blocks as opposed to data at many individual memory locations; moreover, because data in adjacent or close-by addresses is very often needed (“spatial locality”), the inefficiency is not as great as randomly distributed addressing would cause.
A common structure for each entry in the cache is to have at least three elements: a “tag” that indicates where (generally an address) the data came from in memory; the data itself; and one or more flag bits, which may indicate, for example, if the cache entry is currently valid, or has been modified.
Regardless of the number, type or structure of the cache(s), however, the standard operation is essentially the same: When a system hardware or software component needs to read from a location L in storage (main or other memory, a peripheral storage bank, etc.), it first checks to see if a copy of that data is in any cache line(s) that includes an entry that is tagged with the corresponding location identifier, such as a memory address. If it is (a cache hit), then there is no need to expend relatively large numbers of processing cycles to fetch the information from storage; rather, the processor may read the identical data faster—typically much faster—from the cache. If the requested read location's data is not currently cached (a cache miss), or the corresponding cached entry is marked as invalid, however, then the data must be fetched from storage, whereupon it may also be cached as a new entry for subsequent retrieval from the cache.
In most systems, the cache will populate quickly. Whenever a new entry must be created, for example because the cache has a fixed or current maximum size and has been filled, some other entry must therefore be evicted to make room for it. There are, accordingly, many known cache “replacement policies” that attempt to minimize the performance loss that each replacement causes. Many of these policies rely on a “least-recently used” (LRU) heuristic, which implements different types of predictions about which cache entries are least likely to be used and are therefore most suitable for eviction.
In some schemes, for various known reasons, including reducing demand on the cache, some memory locations may be marked as non-cacheable, in which case, of course, the soft- or firmware that controls the cache will not create an entry for them on misses. Furthermore, the cache may also be used analogously for data writes. Two common write policies include “write back,” in which modified data is held in the cache until evicted or flushed to a backing store, and “write through,” in which modified data is concurrently stored in the cache and written to the backing store.
The greatest performance advantage, at least in terms of speed, would of course occur if the cache (to include, depending on the system, any hierarchical levels) were large enough to hold the entire contents of memory (and/or disk, etc.), or at least the portion one wants to use the cache for, since then cache misses would rarely if ever occur. In systems where the contents of the hard disk are cached as well, to be able to cache everything would require a generally unrealistic cache size. Moreover, since far from all memory locations are accessed often enough that caching them gives a performance advantage, to implement such a large cache would be inefficient. Such theoretical possibilities aside, in most modern systems the CPU cache will be much smaller than memory, and smaller still than a hard disk; other caches such as server-side flash storage or RAM-backed caches may be larger, but they will also be slower.
On the other hand, if the cache is too small to contain the frequently accessed memory or other storage locations, then performance will suffer from the increase in cache misses. In extreme cases, having a cache that is far too small may cause more overhead than whatever performance advantage it provides, for a net loss of performance.
The cache is therefore a limited resource that should be managed properly to maximize the performance advantage it can provide. This becomes increasingly important as the number of software entities that a CPU (regardless of the number of cores) or multiprocessor system must support increases. One common example would be many applications loaded and running at the same time—the more that are running, the more pressure there is likely to be on the cache. Of course, some software entities can be much more complicated than others, such as a group of virtual machines running on a system-level hypervisor, all sharing the same cache. As with other hardware resources, either a human or automatic administrator should therefore preferably carry out some policy to most efficiently allocate the cache resource, to implement some preference policy, etc. This task becomes even more complicated in hosted or “cloud computing” environments, where many physically and/or logically isolated client systems share the memory and storage subsystems of one or a cluster of servers (such as network attached storage servers), storage area networks, etc., with each client system expecting or needing at least some minimum quality of service level. In many cases, client systems may be virtual machines that must be instantiated or loaded and managed and can change in number and workload dynamically.
There are, accordingly, many existing and proposed systems that attempt to optimize, in some sense, the allocation of cache space among several entities that might benefit from it. Note the word “might”: Even if an entity were exclusively allocated the entire cache, this does not ensure a great improvement in performance even for that entity, since the performance improvement is a function of how often there are cache hits, not of available cache space alone. In other words, generous cache allocation to an entity that addresses memory in such a way that there is a high proportion of misses and therefore underutilizes the cache may be far from efficient and cause other entities to lose out on performance improvements unnecessarily. Key to optimizing cache allocation—especially in a dynamic computing environment—is the ability to determine the relative frequencies of cache hits and misses.
FIG. 1 illustrates qualitatively a typical “miss ratio curve” (MRC) which is often used to represent cache performance. By convention, an MRC is plotted with the cache size on the X-axis, and the cache miss ratio (i.e., misses/(hits+misses)) on the Y-axis. In the region marked “A” in FIG. 1, the cache is so small that it has a high rate of misses; in this region, the performance loss of handling cache misses could even outweigh any gains achieved for the relatively few cache hits. In the region marked “C”, however, the cache is so large that even an increase in its size will bring negligible reduction in cache misses—the cache effectively includes the entire memory region that is ever accessed. In most implementations, at any given moment of execution, the preferred choice in the trade-off between performance and cache size will normally lie somewhere in the region marked “B”. In some cache partitioning and allocation schemes (see, for example, U.S. Pat. No. 7,107,403, Modha, et al., “System and method for dynamically allocating cache space among different workload classes that can have different quality of service (QoS) requirements where the system and method may maintain a history of recently evicted pages for each class and may determine a future cache size for the class based on the history and the QoS requirements”), even the slope of the MRC is used to help determine the optimal partitioning and allocation.
A miss ratio curve (MRC) thus summarizes the effectiveness of caching for a given workload. A human administrator or an automated program can then use MRC data to optimize the allocation of cache space to workloads in order to achieve aggregate performance goals, or to perform cost-benefit tradeoffs related to the performance of competing workloads of varying importance. Note that in some cases, a workload will not be a good caching candidate, such that it may be more efficient simply to bypass the caching operations for its memory/storage accesses. The issue then becomes how to construct the MRC.
It would be far too costly in terms of processing cycles to check every memory access request to test if it leads to a cache hit or a cache miss and to construct the MRC based on the results. Especially in a highly dynamic computing environment with many different entities vying for maximum performance, exhaustive testing could take much longer than the performance advantage the cache itself provides. Different forms of sampling or other heuristics are therefore usually implemented. For example, using temporal sampling, one could check for a hit or miss every n microseconds or processing cycles, or at random times. Using spatial sampling, some deterministically or randomly determined subset of the addressable memory space is traced and checked for cache hits and misses.
Many existing MRC construction techniques are based on Mattson's Stack Algorithm, described, for example, in R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. “Evaluation Techniques for Storage Hierarchies”, IBM Systems Journal, Volume 9, Issue 2, 1970. The Mattson Stack Algorithm maintains an LRU-ordered stack of references and yields a histogram of stack distances (also known as reuse distances) from which an MRC can be generated directly. Unfortunately, the cost of maintaining and updating the associated data structures is expensive in terms of both time and memory space, even when efficient data structures (such as hash tables and balanced trees) are employed.
Spatial sampling has been proposed in the prior art to reduce the cost of MRC construction, essentially running Mattson's Stack Algorithm over the subset of references that access sampled locations. For example, according to the method disclosed in U.S. Pat. No. 8,694,728 (Waldspurger et al., “Efficient Online Construction of Miss Rate Curves”), a set of pages is selected randomly within a fixed-size main-memory region to generate MRCs for guest-physical memory associated with virtual machines. Earlier computer architecture research by Qureshi and Patt on utility-based cache partitioning (Moinuddin K. Qureshi and Yale N. Patt. “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches”, in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39), December 2006) proposed adding novel hardware to processor caches, in order to sample memory accesses to a subset of cache indices.
For some applications of MRCs, however, randomly selecting a subset of locations to sample is challenging. In many cases, for example, such as those involving accesses to I/O devices, the entire set of locations from which the sample must be drawn may not be known until after the workload has completed. In other cases, even if the complete set of locations is known up-front, it may span an extremely large range, of which only a small fraction may be accessed by the workload, so that storing even the reduced set of sampled locations may still prove very inefficient. Furthermore, the skewed nature of I/O access patterns can cause pre-selection of random samples from a large storage address space to yield inaccurate results. In some cases, a stratified sampling approach can help characterize the space by first dividing it into subgroups. For example, Kodakara et al., in Sreekumar V. Kodakara, Jinpyo Kim, David J. Lilja, Wei-Chung Hsu and Pen-Chung Yew. “Analysis of Statistical Sampling in Microarchitecture Simulation Metric, Methodology and Program Characterization”, in Proceedings of the 10th IEEE International Symposium on Workload Characterization (IISWC '07), September 2007, proposed a stratified sampling approach for processor microarchitecture simulation with a set of benchmarks, using a time-based division of program execution into distinct phases, which are each sampled.
While such techniques can be effective in some cases, they do not work well when access patterns are irregular or non-stationary, resulting in large sampling errors and inaccurate simulation results. An approach that requires neither prior information about workloads nor the ability to analyze or classify program phases is therefore desirable. Moreover, the cost of stratified sampling would be prohibitively high for any inline processing involving a large storage address space.
Unless exhaustive testing is implemented, in order to be able to evaluate cache performance using miss-ratio or (equivalently) hit-ratio statistics, an administrator or automatic software module must decide which memory (or disk or other storage) accesses lead to cache hits (or misses); the universe of memory/disk accesses must be sampled.