Memory devices are typically provided as internal, semiconductor, integrated circuits in computing systems. There are many different types of memory including volatile and non-volatile memory. Volatile memory can require power to maintain its data (e.g., host data, error data, etc.) and includes random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), synchronous dynamic random access memory (SDRAM), and thyristor random access memory (TRAM), among others. Non-volatile memory can provide persistent data by retaining stored data when not powered and can include NAND flash memory, NOR flash memory, and resistance variable memory such as phase change random access memory (PCRAM), resistive random access memory (RRAM), and magnetoresistive random access memory (MRAM), such as spin torque transfer random access memory (STT RAM), among others.
Computing systems often include a number of processing resources (e.g., one or more processors), which may retrieve and execute instructions and store the results of the executed instructions to a suitable location. A processing resource (e.g., CPU) can comprise a number of functional units such as arithmetic logic unit (ALU) circuitry, floating point unit (FPU) circuitry, and/or a combinatorial logic block, for example, which can be used to execute instructions by performing logical operations such as AND, OR, NOT, NAND, NOR, and XOR, and invert (e.g., inversion) logical operations on data (e.g., one or more operands). For example, functional unit circuitry may be used to perform arithmetic operations such as addition, subtraction, multiplication, and/or division on operands via a number of logical operations.
A number of components in a computing system may be involved in providing instructions to the functional unit circuitry for execution. The instructions may be executed, for instance, by a processing resource such as a controller and/or host processor. Data (e.g., the operands on which the instructions will be executed) may be stored in a memory array that is accessible by the functional unit circuitry. The instructions and/or data may be retrieved from the memory array and sequenced and/or buffered before the functional unit circuitry begins to execute instructions on the data. Furthermore, as different types of operations may be executed in one or multiple clock cycles through the functional unit circuitry, intermediate results of the instructions and/or data may also be sequenced and/or buffered. A sequence to complete an operation in one or more clock cycles may be referred to as an operation cycle. Time consumed to complete an operation cycle costs in terms of processing resources, computing performance and power consumption.
In many instances, the processing resources (e.g., processor and/or associated functional unit circuitry) may be external to the memory array, and data is accessed via a bus between the processing resources and the memory array to execute a set of instructions. Processing performance may be improved in a processing in memory (PIM) device, in which a processor may be implemented internal and/or near to a memory (e.g., directly on a same chip as the memory array). As used herein, a PIM device is intended to mean a device in which a processing capability is implemented internal and/or near a memory. PIM device may save time by reducing and/or eliminating external communications and may also conserve power. PIM operations can involve bit vector based operations. Bit vector based operations are performed on contiguous bits (also referred to as “chunks”) in a virtual address space. For example, a chunk of virtual address space may have a contiguous bit length of 256 bits. The contiguous chunks of virtual address space may or may not be contiguous physically.
A typical cache architecture (fully associative, set associative, or direct mapped) uses part of an address generated by a processor to locate the placement of a block of data in the cache (also referred to herein as a “cache block”) and may have some metadata (e.g., valid and dirty bits) describing the state of the cache block. A cache tag is a unique identifier for a group of data in the cache. A last level cache architecture may be based on 3D integrated memory, with tags and meta data being stored on-chip in SRAM and the blocks of cache data in quickly accessed DRAM. In such an architecture, the matching occurs using the on-chip SRAM tags and the memory access is accelerated by the relatively fast on-package DRAM (as compared to an off-package solution).
A cache architecture may have multiple levels of cache operating with multiple processing resources (processor cores). For example, a laptop may have two processing cores and two levels of cache, one for instructions and one for data. The second level cache (L2) may be referred to as the last level cache (LLC) and be able to store 256 kilobytes of data. A server may have three or more levels of cache. In a three level cache the third level cache (L3) may serve as the last level cache (LLC). All of the processing cores should have the same view of memory. Accordingly, a cache based memory system will use some form of cache coherency protocol, e.g., either a MESI (modified, exclusive, shared, invalid) or directory based cache coherency protocol, in order to maintain access to accurate data in the cache memory system between the processing cores.
Code running on a processing core may want to access a bit vector operation device, e.g., PIM device, to perform a bit vector based operation. A processing resource in a host is generally aware of its own cache line bit length (a cache line can also be referred to herein as a “cache block”) to maintain its cache coherency. However, a bit vector based operation in a PIM device may operate on bit vectors of a much different bit length. A typical use pattern for performing a bit vector based operation while maintaining cache coherency in software may involve expensive flushing of an entire cache or marking particular pages as uncacheable (not available to use in the cache). Flushing cache memory involves writing an entire block of cache entries back to memory and deleting the cache entries to free up space for use in the cache memory. Flushing an entire cache memory may unnecessarily remove useable cache entries from the cache memory and consumes a significant amount of power and time in performing the operation.
By contrast, marking cache entries as invalid (also referred to as “invalidating” cache entries or “cache invalidate”), involves marking specific cache entries, e.g., specific cache lines, and deleting just those cache entries to free up space for use for another purpose in the cache memory. Hence, a cache invalidate command to write a specific cache entry, e.g., cache line, back to memory and to delete the cache entry in cache memory for another purpose consumes less power and time than a flushing operation. A cache invalidate operation is one technique for ensuring that data is consistent between a host device and a memory device. However, to make a PIM device fully cache coherency protocol aware would be very costly and complex.