Single instruction multiple data (SIMD) processing units are arranged to perform the same operation on multiple data items simultaneously. This allows SIMD processing units to process data items in parallel, which may be more efficient than processing each data item in series. SIMD processing units are particularly useful when the same instruction is to be executed on a large number of data items, which is common in multimedia applications. For example, a graphics processing unit (GPU) may use a SIMD processing unit in order to perform operations on each of a large number of pixels of a computer-generated image. Similarly, an image processing unit for processing image data (e.g. captured by a camera), which may for example be part of a camera processing pipeline, may use a SIMD processing unit in order to perform operations on each of a large number of pixels of an image.
A task may be formed of a plurality of “work items”, wherein the work items of a task can be executed to thereby execute a common sequence of instructions on respective data items. That is, a work item may comprise a sequence of instructions to be performed on a data item, wherein a group of work items which comprise the same sequence of instructions to be performed on respective data items are grouped together into a task. Each task may include up to a predetermined maximum number of work items. The maximum number of work items that can be included in a task may vary in different systems, but FIG. 1 represents a task 100 which can include up to thirty two work items 102. For clarity, only some of the work items 102 are labelled in FIG. 1. FIG. 1 also indicates some of the thirty two different item positions within the task (from position 0 to position 31) at which a work item may be included. The different work items within a task may be executed in parallel since they are respective instances of an instruction to be implemented on respective data items. The task 100 is not full of work items, and is therefore considered to have “partial residency”. That is, the task 100 includes fewer than thirty two work items although it has capacity for thirty two work items. Positions in the task 100 which have shading in FIG. 1 include a work item, whereas positions in the task 100 which are not shaded in FIG. 1 do not include a work item. Therefore, the task 100 includes seventeen work items, for execution on the SIMD processing unit, at positions 0 to 16, and does not include work items at positions 17 to 31. Furthermore, the work items 102 at positions 0 to 6, 9, 11, 15 and 16 are valid work items for execution by a SIMD processing unit. However, as explained in more detail below, some work items may be invalid, in which case they will not be executed by the SIMD processing unit. The work items 102 at positions 7, 8, 10 and 12 to 14 are invalid work items in the example shown in FIG. 1 and are shown as cross-hatched.
A SIMD processing unit may comprise a plurality of processing lanes which are each configured to execute an instruction of a work item in each of a plurality of processing cycles. FIG. 2 represents the processing of tasks using a SIMD processing unit which comprises sixteen processing lanes, denoted 200 in FIG. 2. The combination of a processing lane and a processing cycle comprises a processing “slot” in which an instruction of a work item may be processed. In this case, the processing cycles are clock cycles, and FIG. 2 shows four clock cycles labelled clk 0, clk 1, clk 2 and clk 3. Instructions of the work items from the first sixteen positions of a first task (task 100) are scheduled to execute across the sixteen processing lanes in the first processing cycle (clk 0); and instructions of the work items from the next sixteen positions of the task 100 are scheduled to execute across the sixteen processing lanes in the second processing cycle (clk 1). In the next clock cycles (clk 2 and clk3) the processing lanes are scheduled to execute work items from the next task. Where a task has partial residency then some processing slots will be wasted, i.e. work items will not be executed in those processing slots. This is apparent from FIG. 2, in that task 100 does not include work items at positions 17 to 31, and as such in the second clock cycle (clk 1) an instruction from only one work item (10216) will be executed. Therefore, fifteen processing lanes are idle during clock cycle clk1 in the example shown in FIG. 2. Furthermore, if an invalid work item is scheduled for execution in a processing slot then that processing slot is also wasted because invalid work items are not processed. Therefore, in the example shown in FIG. 2 the processing lanes 7, 8, 10, 12, 13 and 14 are idle during the first processing cycle (clk 0) because work items 1027, 1028, 10210, 10212, 10213 and 10214 are invalid work items in task 100. The system shown in FIG. 2 therefore results in wasted processing slots for the reasons given above.
Modern graphics application programming interfaces (APIs) such as OpenGL and Microsoft's DirectX define instructions that operate across pixels within a 2×2 pixel quad. For example it is often necessary to determine the rate of change of a varying quantity between different pixels by way of a “gradient” operation. The presence of these instructions prevents the removal of “empty” pixel slots (which correspond to invalid work items) when packing work items into tasks.