Stencil computation techniques are implemented in various scientific applications, ranging from image processing and geometric modeling to solving partial differential equations through finite-difference methods, and in seismic modeling which is common in oil-industry practices. Stencil computations are also utilized in the game-of-life and in the pricing of American put stock options. In a stencil computation, the values of a set of domain points are iteratively updated with a function of recent values of neighboring points. An outermost loop in a stencil kernel is always an iteration through the time domain, sometimes for tens of thousands of time steps, wherein the values of the last iteration are the desired results of the stencil computation. Most often, the domain is represented by a rectilinear grid or an equivalent Cartesian description. It can be regarded as a matrix multiplication, updating the value of an element usually with a sum of products or with a sum of products of sums, if the symmetry allows. Another property of stencil operators, commonly found in the literature and in practice, is that the stencil operator, comprised by the structure of the neighborhood and coefficients applied to each neighbor, is the same for all domain points. In this regard, stencil computations diverge from, for example, multi-grid methods.
The importance and ubiquity of stencil computations has resulted in the development of special-purpose compilers, a trend that has its counterpart in current efforts to develop auto-tuners and compilers. It is known that stencil-codes usually perform poorly, relative to peak floating-point performance. Stencils must be handled efficiently to extract high performance from distributed machines. All of these activities confirm the need for stencil computation optimization. Conventional techniques for optimizing stencil computations have focused on applying new technologies (such as SIMD (single instruction multiple data) and cache-bypass), improving data alignment in memory, improving task division in and among processor cores, and using time-skewing and cache-oblivious algorithms.