This relates generally to data parallel programming environments with multiple data processing machines running multiple operators in parallel such as those running on single instruction multiple data (SIMD) machines.
In data parallel processing, separate processing elements perform the same operation on multiple data points simultaneously. SIMD is particularly amenable to graphics processors that process images. However, SIMD instructions are also executed on central processing units.
Computational work in various types of processors may be divided into a large number of compute instances. A compute instance may be known as a work item in some technologies. A work item is organized into workgroups that can collaborate and communicate. Programming environments may require that the number of work items in a work group, called the local size, divide evenly into the number of work items in the entire kernel launch, called the global size.
With image, and particularly video applications, the most natural global size is often tightly bound to the actual size of the image or the video being processed. In order to satisfy the requirement that the local size evenly divide into the global size, an application may be forced to launch more work items than are necessary and work around the existence of extra work items using conditionals in the kernel code.
This may cause increased code complexity. In addition extra conditionals may sometimes result in negative performance.
Since the work items are frequently mapped to SIMD lanes in a vector machine, the extra conditionals can cause a lower SIMD utilization. Further the complexity of the programming problem can impact performance, increasing significantly in the presence of workgroup barriers. Since barriers must be met unconditionally, they interact poorly with the need to execute work items conditionally.