Concurrent programming for shared-memory multiprocessors can include the ability for multiple threads to access the same data. The shared-memory model is the most commonly deployed method of multithread communication. Multiple threads execute on multiple processors, multiple processor cores, or other classes of parallelism that are attached to a memory shared between the processors. The processors rarely directly access the shared memory. More common is at least one and often two levels of cache associated with each processor, where the caches access the shared memory and the processors access the respective caches or caches shared between two or more processors.
Data from memory is loaded into caches in cache lines, which is an entry in the cache that represents a selected fixed size amount of data. Thus, data is not read from memory in a single byte or word at a time. Instead, an entire cache line of data is read and cached at once. This takes advantage of the principle of locality of reference, which states that if one location of memory is read, then nearby locations are likely to be read soon afterward. Thus, accessing data from memory at an amount of a cache line at a time eliminates expensive trips to main memory for typical access patterns of sequential code.
When multiple caches are included in a multiprocessing system, a cache coherency protocol is used to ensure integrity of copies of data in separate caches. Unfortunately, such protocols can cause scalability problems in concurrent programming. Multiple threads running on distinct processors with distinct caches may be accessing distinct data, but that data may be close enough in memory to exist on the same cache line. In this case, even though the processors are accessing distinct data and need not use locks in the code to prevent race conditions, the multiprocessing system may need to transfer the cache line back and forth between caches to ensure that the multiple processors do not simultaneously modify the cache line data. The result is significantly worse performance than if the multiple processors were able to work independently on their respective data sets.