Work stealing is a widely used algorithm for balancing load in parallel programs designed to run on multi-core processors and multi-socket processor systems. For example, OpenMP (Open Multi-Processing) 3.0, Cilk, Intel® TBB (Thread Building Blocks), and Microsoft® ParallelFX all utilize work stealing algorithms. However, in programs that repeatedly sweep arrays (as in relaxation or time-step numerical methods), elements are processed on different processors on different sweeps. This hurts performance because items have to be moved between caches.