Three-dimension (3D) graphics applications generate image frames by combining input textures, lighting model and physics model. The image generation is intensive in both the number of computational operations required to perform lighting and physics but also in the number of memory operations for fetching the input data and producing processed pixel data for output. Data sizes used for rendering for complex computer visualization operations tend to grow rapidly as scene complexity increases, which may require the use of multiple inter-connected computers (commonly called “nodes” or “ranks”) to handle such data. Using multiple nodes, rendering may be performed in a parallel or distributed manner by dividing rendering operations across the multiple nodes. In this context, parallel rendering refers to the use of multiple nodes to collectively work on a rendering task, while data parallel refers to a group of algorithms or applications in which the entirety of the data is distributed across multiple nodes, such that each rank or node may hold only a subset of the data.
A typical implementation of parallel rendering utilizes a single frame buffer at a “master” node, which may also drive an attached display device. The master node may assign different regions of the output image (or frame buffer) to different worker nodes for rendering. Once the worker nodes generate pixel output for their assigned regions, the worker nodes may send the pixel output to the master for storing in the master frame buffer on the master node. The master frame buffer approach leads to a number of problems when performing parallel rendering. For example, when performing rendering for high resolution, multi-display arrangements, gathering all of the pixel data at a single node before sending the pixel data to a display may result in severe memory and/or input/output bus bandwidth issues and can create a scalability bottleneck at the master node. Additional complications can arise when attempting to schedule rendering tasks across the multiple worker nodes, particularly when progressive refinement algorithms are in use, which may require at least a portion of one or more previous frames' data to create pixel output for the current frame. Generally, the master frame buffer creates a bottleneck whenever said frame buffer contains many pixels, and/or many data items per pixel, and/or requires many operations (e.g., per-pixel filtering, post-processing, tone mapping, compression, blending, compositing, etc.) to be performed for each pixel.