1. Field of the Invention
The present invention generally relates to caching in a storage hierarchy, and more particularly to a method and system for dynamic allocation of disjoint storage spaces in a shared cache.
2. Description of the Related Art
Caches are used within storage hierarchies to reduce access latencies. Each level of a storage hierarchy has a size and a performance penalty associated with accessing it. Typically, levels with more storage have a higher performance penalty. For example, the traditional memory storage hierarchy of a computer (excluding caches) consists of a small number of processor registers, solid state memory, magnetic disks, and magnetic tapes.
Caches can be inserted into a storage hierarchy to improve performance by taking advantage of locality of reference. A cache has a lower performance penalty than the storage layers below it, and it stores the most recently referenced items from the hierarchy below.
FIG. 1 shows how a set-associative cache 10 maps repeatedly onto a large store 11 lower in the storage hierarchy. The large store 11 is divided into fixed-size lines 12. As is known, the xe2x80x9clinesxe2x80x9d 12 are the smallest unit stored into the cache. Each cache set stores a small, finite number of lines from the large store at any given time, and each line has a fixed set 13 within which it can be stored in the cache. It is noted that a direct-mapped cache is simply a one-way (i.e., one line per set) set-associative cache.
FIG. 2 shows how an address is interpreted by a conventional set-associative cache that contains 2n sets organized as a linear array. An address request has a tag 21 and an index 22.
The xe2x80x9cindexxe2x80x9d 22 indicates which set 13 (e.g., as shown in FIG. 1) will field the cache reference, and the xe2x80x9ctagxe2x80x9d 21 is used to determine whether the addressed line is stored within one of the lines of the cache set. This address breakdown causes the 2n cache sets (e.g., where xe2x80x9cnxe2x80x9d is an integer) to map repeatedly onto 2n+m lines of storage lower in the storage hierarchy.
Because caches are often comprised of a small amount of fast storage, they are often expensive and it can be advantageous to share them. For example, disk caches and file caches are typically shared by multiple clients.
Shared-memory multiprocessors consist of a plurality of processors sharing a single memory hierarchy. As part of the memory hierarchy, each processor typically has one or more levels of cache that are not shared with the other processors. Additionally, it can be advantageous to insert a shared cache into the memory hierarchy between the processor caches and the main memory. A large shared cache is especially advantageous for alleviating access penalties in non-uniform memory access (NUMA) systems. For purposes of this application, NUMA systems are defined as scalable multiprocessors with distributed memory, where each processor""s memory access latency varies depending upon where the accessed memory is located within the system.
Sometimes it is necessary to partition a shared-memory multiprocessors so that the single system behaves as if it were multiple, independent, and isolated systems. Each partition consists of a subset of the processors and a disjoint portion of the main memory. It is noted that for purposes of the present application, xe2x80x9cdisjointxe2x80x9d is defined as separate address spaces having no overlap therebetween.
Because the processor partitions operate independently, they can generate the same addresses that actually refer to different memory locations. This is problematic for a shared cache when the actual partitioning of main memory occurs below the level of the cache. That is, since processor partitions use the same addresses to refer to different memory locations, destructive collisions (e.g., logical interference) can occur in the shared cache.
To avoid such collisions, the shared cache can be partitioned. However, a mechanism is required for uniquely identifying items cached by each of the processor partitions sharing the cache. The cache must prevent items from one partition from overwriting those of another, and it must return the correct items from each partition when it is read. Hitherto the invention, the conventional systems have not provided such a mechanism.
The performance of caches is critical to the overall performance of a storage hierarchy because a cache that is effectively capturing references to the hierarchy below it is accessed often. Thus, there is a need for a cache partitioning mechanism that adds little or no latency to cache accesses. Once again, hitherto the invention such a mechanism has not been provided.
Because the memory needs of multiprocessor partitions vary depending on their workload, the amount of cache space they need also varies dynamically. Therefore, a mechanism is needed for creating cache partitions of unequal sizes. Such has not been possible in the conventional systems and methods.
Furthermore, sometimes it is necessary to guarantee a processor partition some amount of cache storage space to achieve a desired level of system performance. Therefore, a mechanism may be needed to specify the amount of cache space allocated to a partition. When requirements change, it can be desirable to re-partition the shared cache. Hitherto the invention, the conventional systems have not provided a mechanism for dynamically adjusting the number of cache partitions or the amount of space allocated to each partition.
In general, each processor partition should be able to utilize some portion of the shared cache in order to obtain the benefits of caching. Thus, there may be a need for a mechanism that maintains fairness of cache space usage between partitions. Once again, the conventional systems have not provided such a structure and method.
Thus, in the conventional multiprocessor system using a shared-cache memory, partitioning the multiprocessor system with the processors having separate address spaces, a problem occurs in caching the same address to different address spaces. That is, the shared cache is unable to distinguish between multiple address spaces.
Hitherto the invention, neither have such problems been recognized by those of ordinary skill in the art nor have such mechanisms been provided for solving such problems.
In view of the foregoing and other problems of the conventional methods and structures, an object of the present invention is to provide a method and structure for dynamically partitioning a shared-cache.
Another object of the present invention is to provide a system for partitioning the storage of a shared cache into disjoint portions when cache sharing is not desirable.
In a first aspect of the present invention, a cache memory is provided which is shared among a plurality of separate, disjoint entities having disjoint address spaces. The cache memory includes a segregator for dynamically segregating the storage space allocated to each entity of the entities, based on a contents of an access request, such that no interference occurs with respective ones of the entities.
In a first embodiment of the above-mentioned first aspect of the present invention, the tags of a conventional set-associative cache are extended to include a xe2x80x9cpartition identifierxe2x80x9d (PID) that is unique for each partition within the cache. Whenever the cache is accessed, a PID is provided and included as part of the tag, both for the purpose of tag storage and tag comparison.
The PID tag extension does not guarantee fairness of cache storage space to the partitions, nor does it allow a partition to reserve a portion of the cache for its own use. Rather, partitions compete for storage, and the portion of the cache occupied by any given partition varies dynamically. Therefore, there is no need for a mechanism to allocate cache space to partitions.
This cache partitioning mechanism has several advantages. First, the competition for cache lines between partitions is expected to result in a steady state where each partition has a portion of the cache proportional to its access frequency, and hence, need. This mechanism adjusts dynamically to the cache resource needs of the partitions.
Second, this cache partitioning mechanism has a negligible impact on cache performance because the only significant addition to the conventional cache implementation is a minor extension in the width of the tag comparators.
Since the individual bit comparisons are performed in parallel, the only increase in comparison latency could occur as the results of the bit comparisons are gathered. Since this is commonly implemented using wired-OR logic, any latency increase would be caused by the minor increase in the capacitive load of the wired-OR connection caused by the additional bit comparators. This is likely to be negligible compared with the overall cache access latency.
Third, this cache partitioning mechanism requires very little hardware to be added to a conventional set-associative cache implementation. The tag store must be expanded to include storage for a PID with each tag, but the PID is expected to be only a small number of bits since the number of partitions is expected to be small. In addition, the cache tag comparators must be extended to include a small number of bit comparators for the PID bits.
A second embodiment of the present invention allows the cache sets to be partitioned into disjoint groups so that each of the groups functions like an independent set-associative cache.
This embodiment provides such a function by manipulating the index portion of the cache access address so that a partition is restricted to use a subset of the cache sets. Whenever the cache is accessed, a partition selector and a partition selector mask are provided and used to fix some of the index bits in order to restrict the addressability of the index. This has the effect of basically dividing the set-associative cache into smaller, independent set-associative caches. Any index bits that are replaced by fixed bits are treated as part of the tag.
The cache partitioning mechanism of the second embodiment has several advantages. First, dividing by cache sets allows each partition to enjoy the full associativity of the cache since partitions do not compete for individual cache lines. Many studies have shown the performance advantages of set associativity.
Second, this cache partitioning mechanism has a negligible impact on cache performance when compared with a conventional set-associative cache implementation. There is a small logic delay required to manipulate a few bits of the cache index, and the width of the tag comparators must be extended to include any bits of the index that are modified. The logic delay is a result of replacing specified index bits with bits from the partition selector, and is accomplished with parallel 1-bit multiplexors that incur at most two gate delays. There are two gate delays because a 1-bit multiplexor can always be constructed by performing a logical OR of two logical AND functions, as known. The extension of the tag comparators is accomplished by the same wired-OR mechanism described above.
Third, this cache partitioning mechanism requires very little hardware to be added to a conventional set-associative cache implementation. The tag store must be expanded to include storage for the displaced index bits with each tag, but only a small number of bits is expected to be displaced since the number of partitions is expected to be small. In addition, the cache tag comparators must be extended to include a small number of bit comparators for the displaced index bits. Finally, a small number of 1-bit multiplexors are required to fix specified index bits.
Fourth, this mechanism guarantees cache storage to every partition because the storage is allocated by the partition selectors and partition selector masks. There is no competition for shared space between partitions. Because the mechanism allows the cache partition sizes to vary, appropriately sized partitions can be allocated to provide fairness of storage to all partitions.
Fifth, this mechanism allows the amount of cache storage allocated to partitions to be varied dynamically because the allocation is entirely determined by the partition selector and partition selector mask provided with each cache access. Therefore, the partitioning can easily be adjusted to meet resource needs.
Finally, this mechanism allows cache partitions to be shared because access to a partition is entirely determined by the partition selector and partition selector mask provided with each access.
The present invention also includes a multiprocessor system incorporating the above-described cache memory of the first or second embodiments, a method of dynamically partitioning the shared-cache memory, and a signal bearing medium storing a program embodying the method according to the present invention.