Technical Field
Methods and example implementations described herein are generally directed to hardware systems, and more specifically, to management of resources in a hardware system.
Related Art
In related art computer systems, instructions and data were stored in and fetched from a main storage, requiring a memory management system for execution or use by a central processor unit, or possibly by some special function unit, such as a floating-point processor. In few systems, some instructions and data may be retained after their use in a cache memory, which can be accessed more quickly than the main storage. As a result, such instructions and data can be reused later in the execution of the same program. This related art scheme improves execution performance of computer systems by reducing the time taken to fetch the instructions and data for processing by central processing unit.
In related art computer systems that have cache memories, number of cycles required to retrieve an instruction or a data item depends on whether the data item is already in the cache or not, and on how many instructions are required to address or retrieve the data item. If the data item is not in the cache (e.g., a “cache miss”), the instruction or data item must be fetched from main memory, which consumes some number of instruction cycles. If the data item is in the cache, some instruction cycles will also be consumed, although the consumed instruction cycles will be fewer than in the case of a cache miss. Nevertheless, any improvement that can be made in processing of cached data and instructions is useful. In certain circumstances, improvement may make a considerable difference to the processing performance of the system.
FIG. 1(a) and FIG. 1(b) illustrate cache memory architectures 100 and 110 respectively, showing placement of cache memory in the hardware layout. As is illustrated, cache memory 104 is positioned between CPU 102 and main memory 106. Data block access from the cache 104 is much faster when compared with access of the same data block from the main memory 106. Similarly, FIG. 1(b) illustrates multiple caches 114, 116, and 118 configured between the CPU 112 and main memory 120.
In most related art, caching techniques have a fundamental tradeoff between cache latency and hit rate, wherein larger caches have better hit rates but longer latency. To address this tradeoff, many computers use multiple levels of cache, with small fast caches backed up by larger, slower caches. Multi-level caches generally operate by checking the smallest level 1 (L1) cache 114 first. If a hit occurs in L1, the processor proceeds at high speed. If the smaller cache misses, the next larger cache 116 (L2) is checked, and so on to L3 caches such as 118, before external/main memory 120 is checked.
FIG. 2(a) illustrates structural layout of cache memory 200. As is illustrated, the cache memory 200 comprises multiple blocks, each having a length of K words. Each block line is also associated with a tag that identifies the block being stored. Tag is usually the upper portion of the memory address. As illustrated, the cache memory 200 comprises C blocks, which is much lesser than the number of blocks, say M, of the main memory. FIG. 2(b) illustrates architectural layout of interactions 250 between cache memory 254, processor 252, and system bus 260 through address buffer 256 and data buffer 258. As represented, processor 252 sends address level instructions to the cache to identify the location of data block that is to be fetched along with issuing data requests to the cache 254. Address information paths are provided between the CPU 252, cache 254, and address buffer 256, whereas data information paths are provided between CPU 252, cache 254, and data buffer 258. The cache 254, address buffer 256, and the data buffer 258 all interact with the system bus 260 to receive data blocks and interact with the main memory (not shown).
Typically, a cache is divided into a number of sets of lines, wherein each set comprises a fixed number of lines. A data block from main memory can be configured to map to any line in a given set determined by the respective block address. For instance, in case there are “m” number of lines in the cache, “v” number of sets, and “k” number of lines per set, the value of k would be k=m/v. In such a case, a main memory block number “j” can be placed in a set “i” based on the equation, i=j modulo v.
Improvements in cache memory performance have been sought using various methods of linking and associating groups of cache lines so as to form a policy that is configured to decide where in the cache a copy of a particular entry of main memory will go. If the policy is free to choose any entry in the cache to hold the copy, the cache is called “fully associative”. At the other extreme, if each entry in main memory can go in just one place in the cache, the cache is “direct mapped”. Many caches implement a compromise in which each entry in main memory can go to any one of N places in the cache, and are described as “N-way set associative”. For instance, in a 2-way set associative, any particular location in main memory can be cached in either of 2 locations in a data cache. Similarly, in a 4-way set associative, any particular location in main memory can be cached in either of 4 locations in a data cache. Multiple algorithms can be used for determining the location in which the data block can be stored.
Indexing in a cache design refers to a method of storing each address in a subset of the cache structure. A common related art mechanism involves using low-order address bits to determine the entry, or the set of entries, that the data block can be stored. By restricting addresses to a very small set of entries, there is a possibility that the most useful data (usually the most recently used data) may all map to the same set of entries. Such a mapping would limit the effectiveness of the cache by utilizing only a subset of the entire structure. For indexed caches to work effectively, the addresses needed by a program at any particular time need to be spread across all of the sets of the cache. Addresses spread across the cache allow full use of the lines in the cache. Most programs naturally have a good distribution of addresses to sets, which is one reason caches work well in general.
A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. A cache read miss from an instruction cache generally causes the most delay, because the processor, or at least the thread of execution, has to wait (e.g., stall) until the instruction is fetched from main memory. A cache read miss from a data cache, on the other hand, usually causes less delay, because instructions not dependent on the cache read can be issued and continue execution until the data is returned from main memory, and the dependent instructions can resume execution. A cache write miss to a data cache generally causes the least delay, because the write can be queued and there are few limitations on the execution of subsequent instructions. The processor can continue until the queue is full.
Lowering the cache miss rate is a major area of focus. Therefore a great deal of analysis has been done on cache behavior in an attempt to find the best combination of size, associativity, block size, and so on. There can be multiple kinds of cache misses, which can impact the cache and processing performance in different ways. For instance, compulsory misses are those misses that are caused by the first reference to a location in memory. Cache size and associativity make no difference to the number of compulsory misses but prefetching data can help here, as can larger cache block sizes. Capacity misses are those misses that occur regardless of associativity or block size of the cache memory, solely due to the finite size of the cache. Conflict misses, on the other hand, are misses that could have been avoided had the cache not evicted an entry earlier. Conflict misses can be further broken down into mapping misses, that are unavoidable given a particular amount of associativity, and replacement misses, which are due to the particular victim choice of the policy (e.g., such as a replacement policy).
While the natural address distribution in programs is generally acceptable, cache performance is often limited by inadequate distribution. Some critical code sequences may concentrate activity in a particular set, which results in new lines replacing other lines that are still useful. If a program tries to access the replaced lines, the program will result in a cache miss and performance will be reduced while the processor waits for the cache to be refilled. As explained above, these caches misses are referred to as conflict misses. Cache itself may be large enough to store all of the useful lines, but limitations due to indexing force useful lines out of the cache even though there are less useful lines elsewhere in the cache.
There are a few methods of reducing the problem of conflict misses. One way is to allow each address to go to multiple locations (set-associative). This method allows hardware to choose among several possible lines in the cache to evict. Performance can be improved by carefully selecting which line to replace, making sure the least useful address is replaced. A different approach to reducing conflict misses is to improve upon the natural distribution of addresses across sets. Using low-order bits provides a good distribution, but some patterns may exist that lead to less distribution and more conflicts. These patterns can happen because programs are written by people and compiled in a non-random manner.
To improve distribution, an index hash can be used. Hashing involves manipulating the address in such a way that any natural pattern is less likely. Hashing can be implemented by means of a hash table that uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found. As hash functions typically introduce randomness in placement of data blocks based on indexes calculated such as by XOR′ing high-order bits with low-order bits, usage of hash tables is one way to “randomize” the placement of data blocks, which can lead to a more even distribution.
In order to make room for storing additional blocks (e.g., data or instructions copied from the storage device or the memory device), each cache may have a replacement policy that enables the cache to determine when to evict (e.g., remove) particular blocks from the cache. Multiple replacement policies exist for deciding which position to load the new data block to. A random replacement policy, for instance, places the new data block in any set/block of the cache memory, but increases the probability of the miss rate, as high priority data blocks may be made to leave the cache in such a process. Other policies can include first in, first out (FIFO), which makes the oldest block exit from the cache. Least recently used (LRU) is yet another technique used for block replacement.
Shared-memory multiprocessors have been applied quite considerably in high performance computing and continue to become more relevant in the age of large multicore systems on chip (SoC). Address space is typically shared among multiprocessors so that they can communicate with each other through that single address space. In such architectures, same cache block across multiple caches may result in a system with caches because of sharing of data. This problem does not affect the read process. However, during a write operation, when one processor writes to one location, the change has to be updated to all caches. Most cache coherency protocols have a shared state in which data can be shared between any number of system components (e.g., processors). Such a shared (S) state arises when a system component requests a read-only copy of the data and the data was already in an Exclusive (E) state in another system component.
Each of the requesting system component and the system component that had a copy of the data can mark the data in shared state. When data is in the shared state, it can be freely copied by the system components by requesting a read-only copy of the data. In a system, cache coherency protocols can either permit a system component to provide the shared data to a requesting system component or the data can be retrieved from the coherency maintenance data structure directly.
In directory-based cache coherency systems, cache line addresses being shared by agents in the system are tracked in a common directory that maintains coherence information between agent caches. Such a directory acts as a filter through which a processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed, the directory either updates or invalidates the other caches with that entry. A cache coherence protocol uses data structures and messaging to track and co-ordinate locations of all cached copies of every block of shared data. These data structures can be centralized or distributed and are called directories. For each block of data there is a directory entry that contains a number of pointers, which are configured to indicate system agent(s) where block copies are located and, as a result, keep track of the cached copies of the data block.
When the number of sharer agents in a system is large, maintaining a bit vector for the sharers is more efficient than binary pointers for each sharing agent. Each directory entry also contains a dirty bit to specify whether a unique cache has a permission to write the associated block of data. In implementation, a cache miss results in communication between the node where the cache miss occurs and the directory so that the information in the affected caches is updated. A coherency protocol is a set of mechanisms to maintain coherence between the caches in a system and define states of the cache lines in the system, conditions, and transition between the states and operations and communications performed during coherent read and write requests. MSI is an example of a coherence protocol employed to maintain coherence in a multi-processor system. The letters M (modified), S (shared) and I (Invalid) in the protocol name identifies possible states in which a cache line can be as specified by the protocol.
Each directory entry typically contains a tag corresponding to the address of a given memory block, identifying information for locating all processors that are caching the block, and a status field indicating whether the cached copies are valid. Directory information of a node is used to evaluate read and write requests pertaining to the memory blocks of the node, and to send out coherency messages to all caches that maintain copies. When a processor in the system updates a shared memory block, directory having jurisdiction over the memory block is consulted to determine caches that hold copies of the block. Before the write operation can proceed, invalidation messages are sent to the identified caches and invalidation acknowledgements must be returned to verify that all cached copies have been invalidated. In similar fashion, when a processor requests read access to a shared memory block, the directory having jurisdiction over the block is consulted to identify location and status of all cached copies. Based on the information in the directory, requested block can be provided to requestor from one of the caches holding a valid copy, or from main memory of the node that stores the block.
An efficient data structure is needed to implement directory tables where coherent cache lines addresses, their sharers, and states are tracked. Architecture of such a table should have an implication on the total amount of memory needed for tracking all coherent cache line addresses in the system, mode/manner of utilization of such memory, and performance of the system.
Snooping is a process where individual caches monitor address lines for access to memory locations that they have cached instead of a centralized directory-like structure doing it. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location. In the snooping solution, a snoopy bus is incorporated to send all requests for data to all processors, wherein the processors snoop to see if they have a copy and respond accordingly. This mechanism therefore involves a broadcast, since caching information is stored in the processors. A multiple snoop filter reduces the snooping traffic by maintaining a plurality of entries, each representing a cache line that may be owned by one or more nodes. When replacement of one of the entries is required, the snoop filter selects for replacement of the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each entry. A temporal or other type of algorithm is used to refine selection of whether more than one cache line is owned by the fewest number of nodes.
In related art, structures for directory entries are static and consistent. Directory entries reference an address in a cache for a single agent in a one to one manner. However, as the agents associated with the hardware system increase in number, problems in scalability may begin to occur with such rigid directory structures. For example, in an implementation involving a Network on Chip (NoC), directories can be utilized to maintain cache coherency among the agents associated with the NoC as explained above. As the number of agents increases, maintaining of cache coherency for the agents associated with the NoC may become more difficult.