The ever-increasing gap between processor performance and memory bandwidth is reflected in the growing timing penalty incurred when a processor must fetch data from operating memory. While processor-stalls (awaiting data retrieval) and architectural remedies (e.g., cache memories) are costly enough in single-processor systems, such costs tend to be multiplied in multi-processor systems (including multi-core processors), particularly where multiple processors or processor cores share storage locations (e.g., memory). In that case, modification of the shared data by one of the processors generally requires coherency control—interprocessor communication or other high-level coordination such as “locks” or “semaphores” to exclude the other processors from accessing the potentially-stale shared data while the data-modifying processor carries out the multiple steps required to fetch the shared data from the operating memory, modify the data, and then write the modified data back to the operating memory. In general, any of the excluded processors that requires access to the shared data must await notification that the exclusive access is complete.