Computing systems include one or more memories for storing data and other information in the form of bits. Many computing systems include both a main or primary memory and a cache memory. The main memory, which may include one or more memory structures or storage mediums, stores data and instructions that are executed by a processor (e.g., CPU) or other control unit of the computing system. Some main memories include volatile random access memories (RAM), although other suitable memory types may be provided. The cache memory, which also may include one or more memory structures or storage mediums, is often used by a processor of the computing system to temporarily store copies of data from the main memory to reduce the time or latency required for the processor to access and manipulate requested data. A memory controller, internal or external to the processor, typically controls the indexing of and access to data stored in the cache and in the main memory.
Based on memory requests from the processor, the memory controller populates the cache with data from the main memory after startup of the computing system and on-demand throughout the operation of the computing system. Data is transferred between the main memory and the cache in the form of cache lines. In particular, a “cache line” (or “cache block”) as used herein refers to the unit or block of data from the main memory that is transferred between the main memory and the cache. The cache line is typically fixed in size as set by the processor or memory controller of the computing system. Cache lines may include any suitable size, typically based on the power of two (i.e., cache line size=2n). A common cache line size is 64 bytes. Other suitable cache line sizes may be provided, such as, for example, 16, 32, 64, 128, and 256 bytes. As such, cache lines are used to transfer data from the main memory for temporary storage in the cache. A “data block” as used herein refers to a portion or a data subset of the cache line. For example, a 128-byte cache line may include two 64-byte data blocks.
Row-based memories may be used as the cache for a computing system. A row-based memory includes multiple memory locations organized into rows or “sets,” and each row is operative to store multiple cache lines from the main memory. The number of cache lines storable in each row of the cache is the set associativity of the cache. For example, a 2 kilobyte row of a cache with 64-byte cache lines has a 32-way set associativity (2048/64=32).
In particular, row-based memories typically store data in a bit cell array that includes multiple rows of bit cells. Each bit cell is operative to store a data bit in some physical format. For example, a dynamic random access memory (DRAM) stores charge to encode a bit value (i.e., logical 0 or 1), and resistive memories (e.g., phase-change memory, memristors, etc.) encode the bit value using the resistance of the material in the bit cell. Reading the bit cells typically involves sensing the physical properties (e.g., the presence or absence of charge in DRAM, whether the resistance is high or low in resistive memories, etc.) of an entire row of bit cells in the bit cell array, and then recording or loading all detected values in the row into a row buffer of the memory. To access data in the row-based memory, the memory controller loads a row of the array into the row buffer and then accesses the loaded row buffer such that data in the row buffer can be read from and/or written to. As such, read and write operations performed on the cache are performed at the row buffer. On the other hand, in a memory that is not row-based, such as a static random access memory (SRAM), for example, data is read directly from and written directly to the bit cell array of the memory. As such, data is not required to be first loaded into a row buffer before the read/write operation is performed.
In row-based memories, copying the data from the requested row of the bit cell array into the row buffer is referred to as “activating” or “opening” the row. In some row-based memories, such as DRAM, for example, the data in the row buffer is written back to the bit cell array after the read/write operation or access is complete because the original activation operation often destroys the charges (i.e., data) stored in the activated row. Restoring or writing back the data from the row buffer to a row of the bit cell array is referred to as “precharging” or “closing” the row. Each activation and precharge of the bit cell array consumes energy, increases observed memory access latencies, and reduces memory bank availability. In non-row-based memories, because data is not required to be first loaded into a row buffer before the read/write operation, separate activate and precharge operations are not required for each row access.
FIG. 1 illustrates an exemplary known memory control system 10 including a control unit 12 operatively coupled to a cache memory 14 and to a main memory 18. Control unit 12, such as a processor, includes a memory controller 16 that controls access to memories 14, 18 for read/write operations. Memory controller 16, while illustrated as a single block, includes logic for controlling main memory 18 and logic for controlling cache memory 14. Memory 14 is illustratively a row-based memory 14 serving as a cache memory for control unit 12. Exemplary memories 14 include a dynamic random access memory (DRAM), phase-change memory (PCM), spin-torque transfer magnetoresistive random-access memory (STT-MRAM), or other suitable volatile and non-volatile row-based memories.
Row-based memory 14 includes a bit cell array 20 comprised of a plurality of rows, and each row is comprised of a plurality of bit cells (i.e., storage cells or memory cells) operative to store data, as described herein. Each bit cell of bit cell array 20 represents a “bit” of stored data and has two stable states—an off state (e.g., logical “0”) and an on state (e.g., logical “1”). Some row-based memories, such as some flash memories and phase-change memories (PCMs), for example, allow for non-binary encodings and encode multiple bits of information per bit cell. For example, PCMs may use different levels of resistance to encode multiple bits, e.g., logical “00” is very low resistance, logical “01” is medium-low resistance, logical “10” is medium-high resistance, and logical “11” is very high resistance. An activated row of bit cell array 20 is loaded into the row buffer 22 during the read and/or write access, as described above. Memory 14 may further include a buffer cache 24 that provides additional caching, for example, to improve memory speed (such as in a flash memory, for example).
In the illustrated embodiment, memory 14 is in communication with control unit 12 and memory controller 16 via communication paths 26, 28. Communication path 26 includes one or more electrical lines or conductors for communicating various commands and controls from memory controller 16 to memory 14. Such commands include activate and precharge commands (described herein), read command, write command, and other suitable memory commands, such as power mode control, wake up and sleep mode control, etc. Communication path 28 includes a data bus for communicating data during the read and write operations.
Memory controller 16 includes logic that communicates with main memory 18 via one or more communication links 30. Communication link 30 includes a data bus or data paths for communicating read/write data as well as one or more control paths for communicating controls, commands, and feedback between memory controller 16 and memory 18.
To initiate a memory access and thus a read/write operation, control unit 12 provides a memory access request to memory controller 16 that requests a read or write operation. For example, an application, operating system, or other program or logic executed by control unit 12 provides the memory access requests to memory controller 16. Upon receipt of the memory access request, the memory controller 16 accesses the requested location in cache 14 (loads the corresponding row of array 20 into the row buffer 22) and returns the data to control unit 12 for a read operation or modifies the data in the row buffer 22 for a write operation. If the requested data is not stored in cache 14, memory controller 16 retrieves the data from main memory 18 and stores it in the cache 14.
The access latencies depend on whether the cache access requires closing (i.e., precharging) an already opened (i.e., activated) row of the cache before opening the requested row. If a requested row has already been opened by an earlier memory access request, a read or write can be completed in less time than if the activate and precharge commands also need to be issued.
Conventional memory control systems 10 map requested cache lines to the row-based memory 14 such that sequentially (i.e., consecutively) addressed cache lines of the main memory 18 are mapped to consecutive rows in the cache 14. For example, referring to FIG. 2, three consecutive physical rows (i, i+1, i+2) of bit cell array 20 are illustrated. Addresses A0, A1, and A2 represent consecutive main memory addresses of three cache lines that are stored in bit cell array 20. In other words, address A0 is the main memory address of a first cache line, address A1 is the main memory address of a second cache line that is stored adjacent the first cache line in main memory 18, and address A2 is the main memory address of a third cache line that is stored adjacent the second cache line in main memory 18. Consecutively addressed cache lines of main memory 18 having addresses A0, A1, and A2 are stored in consecutive rows (i, i+1, i+2) of bit cell array 20. As illustrated, the cache line with main memory address A0 is stored in row i, the cache line with address A1 is stored in row i+1, and the cache line with address A2 is stored in row i+2. For a cache line size of 64 bytes, for example, the actual bit values of addresses A0, A1, and A2 are separated by 64 bytes. Bit cell array 20 illustratively stores another cache line having main memory address B0. In the illustrated embodiment, main memory address B0 is at a separate location of main memory 18 that is nonconsecutive with addresses A0, A1, A2. While only four cache lines are illustratively stored in bit cell array 20 of FIG. 2, additional cache lines may be stored in array 20.
As such, to populate bit cell array 20 as illustrated in FIG. 2, memory controller 16 (FIG. 1) maps the cache lines to the row-based memory 14 such that sequentially addressed cache lines of the main memory 18 are mapped to consecutive rows in the cache 14. If the last row of the bit cell array 20 is reached during the mapping of consecutively addressed cache lines, the next consecutively addressed cache line of main memory 18 is mapped to the next available memory location of the first row i, thereby providing a “round-robin” mapping sequence. Various replacement strategies or policies may be implemented to manage the replacement of cache entries in an accessed row of bit cell array 20 with other cache lines from the array 20.
With the cache line organization of FIG. 2, long sequences of sequential cache line accesses cause repeated activations and precharges, thereby increasing the total access latencies. For example, timeline 40 of FIG. 2 illustrates the sequence for accessing the sequential cache lines with main memory addresses A0 and A1. First, accessing the cache line with main memory address A0 requires an activation (ACT) of row i followed by the read (or write) of the cache line. To then access the cache line with main memory address A1, a precharge is required to close row i before another activation (ACT) is implemented to open row i+1 where the cache line with address A1 is stored. Further, memory 14 (FIG. 1) often has a built-in electrical delay between the activation and precharge of a row based on the specifications of the memory chip. This built-in delay (e.g., the Row Active Time (tRAS) for RAM, etc.) is often longer than the time required to perform the read on the row buffer, as illustrated with the DELAY of FIG. 2, thereby further increasing the overall access latencies of the row-based memory 14.
Some conventional memory control systems utilize larger cache lines while attempting to capture spatial locality benefits of main memory data. With larger cache lines and thus a larger block of data transferred from main memory 18 to cache 14, it may be possible to execute memory requests for spatially local data, i.e., data in physically nearby memory locations of the main memory 18, with fewer row accesses. For example, referring to FIG. 3, the cache lines of bit cell array 20 are doubled in size compared with the cache lines of FIG. 2. In particular, address A0 of FIG. 3 references a cache line that spans two cache lines of FIG. 2 (spans both data blocks at addresses A0 and A1). With cache line addresses A0 and A1 of FIG. 2 referencing two separate but consecutive cache lines each with a size of 64 bytes, for example, cache line address A0 of FIG. 3 points to a single cache line having a size of 128 bytes. As such, while FIG. 3 illustrates two separate data blocks associated with main memory addresses A0 and A1, the two data blocks cooperate to form a single larger cache line having main memory address A0.
Three additional, nonconsecutive cache lines are illustrated in row i with main memory addresses B0, C0, and D0, with each cache line spanning two 64-byte data blocks (e.g., B0 and B1; C0 and C1; and D0 and D1) for a total size of 128 bytes. As such, the four cache lines at addresses A0, B0, C0, D0 are at nonconsecutive addresses. Similar to the cache organization of FIG. 2, the cache line at address A2, which is the next consecutive cache line address after address A0, is stored in the next consecutive row (row i+1).
With the larger cache line size of FIG. 3, the set associativity of each cache row is reduced. For example, with 64-byte cache lines and 2-kilobyte rows, bit cell array 20 of FIG. 2 has 32-way set associativity. However, with 128-byte cache lines and 2-kilobyte rows, bit cell array 20 of FIG. 3 has only 16-way set associativity. Further, larger cache line sizes may lead to an increase in false sharing for cache coherent systems. For example, in a multi-core processor system, it is possible that two different processor cores (e.g., core X and core Y) each have a copy of a 128-byte cache line spanning main memory addresses A0 and A1 in their respective caches. If core X writes to the data block with address A0, then core Y must discard the entire cache line, including both data blocks at A0 and A1, because the stored data is no longer valid or up-to-date. If core Y is only accessing data at A1 rather than at A0, then invalidation of the entire cache line wastes power and increases latency because core Y only needed to invalidate the data block at address A0, i.e., because the data at A1 was still up-to-date. As such, because the cache only handles data on cache line granularity (illustratively 128 bytes in FIG. 3), then even if a single byte within an entire cache line is modified, all other copies of the cache line in other cores are invalidated. Accordingly, rather than sharing data at A0 and A1, cores X and Y are actually using disjoint data. Further, memory bandwidth consumption is increased, as the data at both A0 and A1 is transferred even if only one of A0 and A1 is used.
Further still, some portions of the cache line in the accessed row may not be needed but still take up memory space, leading to fragmentation in which unused data blocks occupy cache memory. In particular, memory bandwidth may be wasted when, for example, only a single 64-byte block of data is requested during the row access but the cache line is larger, such as 128 bytes or 256 bytes. In FIG. 3, the data blocks with addresses C1 and D1 are illustratively not utilized in a memory access of row i but are still stored in the row due to the large cache line size described above. As such, a request for the data blocks at address E0 or F0 will result in a “cache miss” because these data blocks are not stored in the cache 14. For example, upon a request for address E0, one of the cache lines at addresses A0, B0, C0, or D0 must be evicted from row i based on the replacement policy of memory controller 16, and then the cache line at address E0 is retrieved from main memory 18 and is installed in row i in the newly vacant location. Accordingly, accessing the data block with address E0 or F0 results in increased access latencies and power consumption. Larger cache lines may thus increase “cache miss” rates due to the fragmentation.
Sub-sectoring may be used by the memory controller 18 to reduce the bandwidth consumption and false sharing impacts of larger cache lines. Sub-sectoring reads from or writes to only needed data blocks or “sectors” (i.e., a portion or data subset of the cache line) of the row buffer during the access. For example, rather than reading the entire cache line spanning addresses C0 and C1, sub-sectoring allows only the needed data block at address C0 to be read. However, sub-sectoring does not solve the problem of reduced cache efficiency and underutilization of the cache due to fragmentation, as the unrequested data blocks with addresses C1 and D1 still occupy row space. Further, sub-sectoring does not solve the problem of reduced set associativity of the cache.
Referring to FIG. 4, a “pool of subsectors” approach is utilized in conjunction with the larger cache lines of FIG. 3. FIG. 4 illustrates row i of bit cell array 20 of FIG. 3, which includes cache lines that are double the size of the cache lines of FIG. 2, along with a tag array 62 comprised of multiple tag entries. Each tag or address entry of tag array 62 identifies the location of a single cache line in array 20. In particular, each tag entry includes a main memory address of a cache line and two pointers that identify the location of data blocks of the cache line in bit cell array 20. Each pointer points to a data block or “subsector” located in a different pool of the row. In particular, bit cell array 20 is divided into two pools such that half of each cache line is allocated to the first pool (pool ‘0’) and the other half of the cache line is allocated to the second pool (pool ‘1’). For example, tag entry A01 stores a main memory address A0 of a cache line, a pointer that identifies the location in pool ‘0’ of the data block of the cache line having address A0, and a pointer that identifies the location in pool ‘1’ of the data block of the cache line having main memory address A1. Tag entry B01 similarly stores a cache line address B0 and two pointers each pointing to the data block or subsector of the cache line in each pool of array 20. Tag entries C01 and D01 each include a single pointer pointing to the respective data block of the corresponding cache line that is stored in pool ‘0’. Since the data blocks having main memory addresses C1 and D1 are not requested in FIG. 4, as in FIG. 3, tag entries C01 and D01 do not include pointers that identify the data blocks at respective addresses C1 and D1. However, due to the larger cache lines, the memory space allocated for the data blocks with addresses C1 and D1 are still occupied and are unavailable for other data, despite the data at addresses C1 and D1 not being requested. As such, similar to the cache organization of FIG. 3, other cache lines requested during the row access, such as data blocks with addresses E0 and F0, will result in a “cache miss” and an increase in access latencies and power consumption. As a result, the cache organization of FIG. 4 may also increase “cache miss” rates due to fragmentation issues.
Therefore a need exists for methods and systems to reduce the access latencies involved with a row-based memory. Further, a need exists for such methods and systems to avoid fragmentation and bandwidth consumption issues associated with large cache lines and sub-sectoring and to improve set associativity and cache utilization.