1. Field of the Invention
The present invention generally relates to processing data using single instruction multiple data (SIMD) cores.
2. Background Art
In many applications, such as graphics processing in a Graphics Processing Unit (“GPU”), a sequence of work-items, which can also be referred to as threads, are processed in order to output a final result. In many modern parallel processors, for example, processors within a SIMD core synchronously execute a set of working items. Typically, the synchronous executing of work-items are identical (i.e., have the identical code base). A plurality of identical synchronous work-items that are processed by separate processors are known as, or called, a wavefront or warp.
During processing, one or more SIMD cores concurrently execute multiple wavefronts. Execution of the wavefront terminates when all work-items within the wavefront complete processing. Each wavefront includes multiple work-items that are processed in parallel, using the same set of instructions. Generally, the time required for each work-item to complete processing depends on a criterion determined by data within the work-item. As such, the work-items with the wavefront can complete processing at different times. When the processing of all work-items has been completed, the SIMD core finishes processing the wavefront.
However, since different work-items require different amounts of processing to complete a required task, a parallel processing compute unit can start processing a particular task effectively utilizing all of the processors, but after a certain number of cycles processor efficiency decreases as some of the work-items are completed. This decrease in efficiency is due to the fact that each block of data is an individual work-item, but all of the individual work-items are scheduled and processed as a single workgroup.
The severity of the decrease in efficiency is dependent upon the type of application being processed by the parallel processors. For example, a facial recognition algorithm may attempt to determine if an area in an image is a face by processing different spatial areas of an image in parallel. Once the algorithm determines that the analyzed area is not a face, the work-item for that spatial area terminates and has no additional work to perform on any subsequent processing cycles. Such facial recognition algorithms may consist of dozens of analysis passes on an identified area to determine if that area of the image includes a face. After as few as four or five passes there may only be a small portion of work-items that remain as possible candidate faces as the remaining images have been determined to either be a face or not a face. However, as there are still some remaining portions of the image that have not been determined whether or not they contain a facial image, the process continues, even though only a small portion of the processors are actually performing any valid processing.