The design of processors for graphics operations and general computing has evolved toward increased parallel computations. Typically, this has been achieved by simply increasing the number of parallel computational units at every natural stage of processing. For example, in graphics processing, in a graphics rendering pipeline having a vertex shader unit, followed by a geometry shader unit, followed by a pixel shader unit, and so on, each of the shader units would be made wider by adding more parallel execution hardware. Thus, the result may be a wider vertex shader unit, followed by a wider geometry shader unit, followed by a wider pixel shader unit, and so on. This has yielded appreciable gains in performance in the past. However, this basic approach has failed to efficiently scale as parallelism continues to increase. Significant limitations are becoming clear as the practice continues. For example, each massively parallel stage in a stage-by-stage pipeline tends to provide little granularity of control of portions of each parallel stage. Also, each massively parallel stage becomes unwieldy and prohibitively time-consuming to design. Furthermore, the level of utilization may decrease, as the massively parallel stage struggles during operation to find sufficiently wide units of work to fully occupy the data path. These mounting drawbacks have indicated that simply increasing parallelism at each stage of a stage-by-stage graphics pipeline is not a sustainable technique for continued improvement. Similar challenges face designers when developing processors for parallel computing. Accordingly, there is a compelling need for a new methodology in the design of high performance graphics processing and general computing equipment.