In single instruction multiple data (SIMD) parallel processing models, systems are designed to perform the same computation on many sets of data in parallel. Because SIMD processors have impressive cost to performance ratios, they are typically well suited to graphics processing. A typical SIMD processor consists of a single control unit and a set of processing elements where each element is a fully functional arithmetic logic unit capable of executing instructions. The processing element contains local data stored either on local memory or local registers and the control unit determines the instructions for all processing elements. Each processing element, therefore applies an identical computation to a different set of data.
While many graphics problems can be formulated as identical data computations over large sets of data, some computations require different operations and therefore need to support various levels of control flow. Due to the basic operations of SIMD processors, nested control flow may be problematic. A solution for supporting a single level of conditional control flow includes adding a predicate condition, also referred to as a context bit, for each processing element. When the processing element attempts to write a value, it initially checks the context bit and then does not write the element when the context bit is in an off state. A single predicate bit per processing element is an inexpensive implementation in hardware but is limited in only providing a single level of conditional nesting.
Another option to handle nested conditional flow is utilizing a separate control processor to modify the context bit. Although, utilizing a separate control processor is expensive with the requirement of extra processing elements and can slowdown processing speeds. Although, this approach utilizing separate control processors may be utilized in a super computer it is not a feasible solution in a standard processing system.
Another option is utilizing a stack of bits per processing element in lieu of the single context bit. In one approach, a specialized stack per processing element may add significant cost to the device and the stack itself requires additional instructions to manipulate the stack. Among other things, a push command, a pop command and possibly other instructions that modify the stack are internally required.
As the values on the stack correspond to the processing element being on or off, the values on the stack are not independent. Therefore, either the entire stack contains on values or the bottom of the stack contains on values and the top of the stack contains any arbitrary number of off values. Therefore, another approach is to replace each of the stacks (one stack per processing element) by a set of counters (one counter per processing element). The value in each counter would indicate the number of off settings on the stack relative to a transition stage, such as going from an on to an off value. This approach is beneficial as the use of a set of counters requires less hardware. N bit counter can hold the same information as a 2N bit stack. As the amount of hardware required decreases, this approach still has several limitations. Among other things, certain constructs require a compiler to compute additional information, such as to break from a nested loop requires knowing the exact number of control flow constructs that need to be exited by the break. Furthermore, when the amount of hardware has decreased, an additional counter is needed for each processing element. If the element is pipelined, a counter is required for each pipeline stage in the processing element. Therefore, since many graphics program do not require control flow, this additional hardware adds significant costs without always improving performance.
As such, there exists a need for allowing SIMD parallel processing in a graphics application for performing data computations over large sets of data with nested control flow.