Recent advances in architecture and programming interfaces have added substantial programmability to graphics pipelined systems. These new features allow graphics programmers to write user-specified programs that run on each vertex and each fragment that passes through the graphics pipeline. Based on these vertex programs and fragment programs, people have developed shading languages that are used to create real-time programmable shading systems that run on modern graphics hardware.
The ideal interface for these shading languages is one that allows its users to write arbitrary programs for each vertex and each fragment. Unfortunately, the underlying graphics hardware has significant restrictions that make such a task difficult. For example, the fragment and vertex shaders in modern graphics processors have restrictions on the length of programs, on the number of resource constraints (i.e., temporary registers) that can be accessed in such programs, and on the control flow constructs that may be used.
Each new generation of graphics hardware has raised these limits. The rapid increase in possible program size, coupled with parallel advances in the capability and flexibility of vertex and fragment instruction sets, has led to corresponding advances in the complexity and quality of programmable shaders. For many users, the limits specified by the latest standards already exceed their needs. However, at least two major classes of users require substantially more resources for their application of interest.
The first class of users are those who require shaders with more complexity than the current hardware can support. Many shaders in use in the fields of photorealistic rendering or film production, for instance, exceed the capabilities of current graphics hardware by at least an order of magnitude. The popular RenderMan shading language, for example, is often used to specify these shaders, and RenderMan shaders of tens or even hundreds of thousands of instructions are not uncommon. Implementing these complex RenderMan shaders is not possible in a single vertex or fragment program.
The second class of users use graphics hardware to implement general-purpose (often scientific) programs. This “GPGPU” (general-purpose on graphics processing units) community targets the programmable features of the graphics hardware in their applications, using the inherent parallelism of the graphics processor to achieve superior performance in microprocessor-based solutions. Like complex RenderMan shaders, GPGPU programs often have substantially larger programs that can be implemented in a single vertex or fragment program. They may also have more complex outputs. For example, instead of a single color, they may need to output a compound data type.
To implement larger shaders than the hardware allows, programmers have turned to multipass methods in which the shader is divided into multiple smaller shaders, each of which respects the hardware's resource constraints. These smaller shaders are then mapped to multiple passes through the graphics pipeline. Each pass outputs results that are saved for use in future passes.
A key step in this process is the efficient partitioning of the program into several smaller programs. For example, a shader program may be partitioned into several smaller shader programs. Conventional programs often use the RDS (Recursive Dominator Split) method. This method has two major deficiencies. First, shader compilation in modern systems is performed dynamically at the time the shader is run. Consequently, graphics vendors require algorithms that run as quickly as possible. Given n instructions, the runtime of RDS scales as O(N3). (Even a specialized, heuristic version of RDS, RDSh scales as O(N2).) This high runtime cost makes conventional methods such as RDS undesirable for implementation in run-time compilers. Second, many conventional partitioning systems assume a hardware target that can output at most one value per shader per pass. Modem graphics hardware generally allows multiple outputs per pass.
There is a need for a partitioning method and system that operates as quickly as possible. There is also a need for a partitioning method and system that allows the output of more than one value from the resulting partitions.