A CPU cache is a computer hardware mechanism used by the central processing unit of a computer to reduce the average time to access memory. A cache is a small, fast memory that retains copies of the data from recently used main memory locations. If a subsequent memory access is to a memory address that has been retained in the cache, the memory access is satisfied using the cache memory. Thus, the more accesses that are performed from cached memory locations, the lower the average memory access time, and the faster the application program runs.
Cache memory is subdivided into cache lines. Each cache line has a copy of some fixed-size, contiguous range of bytes of main memory. Each cache line also has an address tag and other state that identifies whether the cache line is presently valid and if so what addressed range of data are retained in the cache line. Cache lines are of a fixed size, typically 32 to 256 bytes, that depends upon the hardware design. When a CPU performs a read or write memory access to a data at a given address in main memory, it also checks whether that address is contained in its cache, in particular, if the cache contains a cache line which is valid and whose address tag matches the memory address of the access. If so, then a cache hit occurs, and the CPU accesses the data in the cache line. Otherwise, a cache miss occurs and the CPU proceeds down the slower path of accessing the data elsewhere, and recording a copy of the data in a cache line in the CPU cache. Since a cache is of fixed size, to retain new data in a cache line, it may be necessary to evict (invalidate) data that may have been previously held in that cache line.
Software application programs execute a sequence of hardware instructions to effect a computation. Such instructions may perform arithmetic operations, may alter the program control flow sequence of subsequent instructions, may read or write (collectively, access) data at specific memory addresses, or perform other operations. When a CPU cache is used with a CPU, its presence, and any information about which addresses are currently cached, and any hardware processes of checking, validating, and invalidating cache lines, is typically invisible and inaccessible to software programs, except that programs usually execute faster.
Modern computers may have a plurality of layers of caches. For example, a small, fast level one cache (L1$) may quickly service most memory accesses; but on an L1$ miss, a larger, slower level two cache (L2$) may be accessed. Only upon a memory access that misses both the L1$ and L2$ would the access be performed to the relatively very slow main memory.
Modern computers may also be multiprocessors, which have a plurality of CPUs. In a shared memory multiprocessor, each CPU may access the same shared memory, so one CPU may write to shared memory and later another CPU may read the data written by the first. Each CPU may have one or more layers of cache for its exclusive use (private cache) as well as one or more layers of caches shared with other CPUs (shared cache). In the presence of multiple CPUs with caches, multiprocessors implement cache coherence to transparently provide the multiple threads of execution in the software program with the illusion that all memory accesses are to a single common shared main memory. Here the simple notion that a given cache line is valid is replaced with more elaborate cache line validity states, state machines, and signaling protocols called cache coherence protocols. Sometimes an access in one CPU (such as a write) must invalidate a cache line in other CPUs.
It is also possible to factor and share hardware resources in a multiprocessor so as to share some, or nearly all, of the duplicated hardware resources between multiple CPUs. In an extreme case, a logical plurality of CPUs can be implemented in hardware in a time multiplexed fashion upon a single CPU core, by providing a plurality of copies of all the processor state and registers, called hardware thread contexts, in a single CPU. This is known as a multithreaded CPU core. For example, a single CPU core with four distinct thread contexts, e.g. four copies of its program counter, general purpose registers, and special purpose registers, nonetheless appears to application software and operating system software as four logical processors (LPs), indistinguishable in behavior, if not performance, from a multiprocessor comprising four separate CPU cores.
Over time, computer hardware has become faster and more powerful. Today's multiprocessors provide multiple CPU cores that can operate in parallel. Programmers would like different pieces of a program to execute in parallel on these multiple cores, to take advantage of the performance improvements that can be achieved. However, parallel programming is quite difficult for the average programmer using the software development techniques of today, and thus system implementers are developing new programming models that can be used to better write parallel programs. Some of these new programming models are following a transactional memory approach, which uses a transaction abstraction to help coordinate parallel threads' access to shared memory. Transactions do not automatically provide parallelism per se, but they do shift some of the burden of coordinating parallel tasks to other parts of the system, such as the compiler or runtime.