Computers read, store, and manipulate data in memory. Ideally, a computer would have a singular, indefinitely large and very fast memory, in which any particular data would be immediately available to the computer. In practice, this is not practical because memory that is very fast is also very expensive.
Thus, computers typically have a hierarchy (or levels) of memory, each level of which has greater capacity than the preceding level, but which is also slower with a less expensive per-unit cost. Keeping frequently-needed data in a small but fast level of memory and infrequently-needed data in a slow level of memory can substantially increase the performance of a computer.
Another way to increase performance is to use multiple processors executing simultaneously, each with their own cache (fast level of memory) but sharing data. The caching of shared data among multiple processors introduces a new problem: cache coherence, that is if multiple processors each have a cached copy of data from a shared memory location, all of those cached copies need to be the same.
To ensure cache coherence, multi-processors systems use a technique called a cache coherence protocol. In a conventional coherence protocol, a write from a first processor's memory to a second processors's memory would go through the following steps: first processor performs a write, which results in a miss in a cache local to the first processor. A request is sent to a node of a second processor, which consults a directory. A controller for the send processor reads the target line from either from memory or from a cache local to the second processor and sends the line to the first processor, where the line is saved in the first processor's cache, modified, and marked as dirty. Later, the second processor reads the memory location written by the first processor, misses in the second processor's local cache, and consults the second processor's directory, which forwards the request to the first processor, which reads the dirty line from the first processor's cache. The line is sent to the second processor where it is written into the second processor's cache, and optionally into the second processor's memory.
Thus, in this scenario, four network traversals are performed, and the entire line is copied first from the second processor to the first processor, and then from the first processor back to the second processor. This is very inefficient, especially if the first processor merely wanted to send the second processor a single word.