The technology described herein relates to a multidimensional mechanism for use in microprocessor systems, and in particular to a multi-dimensional kernel invocation indexing mechanism for a system that executes large numbers of compute kernels, for example where such kernels are executed under compute-shader APIs, like OpenCL and DirectCompute.
As is known in the art, OpenCL (Open Computing Language) is a standardised framework for writing programs that execute across heterogeneous platforms consisting of CPUs (Central Processing Units), GPUs (Graphics Processing Units) and/or other processors. It includes a language for writing “kernels” (programs for executing given functions), and APIs (Application Programming Interfaces) that are used to define and then control the platforms that the kernels are to be executed on. A typical use of OpenCL is to use a graphics processing unit (a graphics processor) for highly parallel data processing operations, such as image processing (e.g. filtering), texture or other data array compression, iteratively solving differential equations, etc.
Under the OpenCL API, a “kernel” is a (usually small) program that is invoked a large number of times. A typical use of kernels is to perform a large computation in parallel, where each invocation of a kernel performs only a small part of the computation (for example for a given region of an image to be processed or of a texture to be compressed). The kernel invocations are taken to run in parallel, and except for certain restrictions (which will be discussed further below), there is no execution order constraints between them. With such an arrangement, each kernel invocation needs to be able to be identified, in order to allow different invocations of the same kernel to work on different parts of the overall computation. For example, each kernel invocation must know which particular part of the data set (e.g. image) in question it is to process. This is usually done by indexing the kernel invocations, using an appropriate kernel invocation ID or index (which index can then be used to, e.g., identify the data set for the invocation in question, etc.). The index may, e.g., be interpreted by the kernel to identify one specific image pixel location to process, or a row of a video buffer to de-interlace, etc.
In the OpenCL API, each kernel invocation has invocation ID (identifier), which is composed of two major parts: a so-called “work-group” ID and a “work-item” ID. The “work group” is a mechanism used in OpenCL to organise kernel invocations into defined groups. The distinguishing feature of a work-group is that all the kernel invocations within a given work-group are able to participate in a “barrier”. (As is known in the art, a “barrier” is a parallel-computing construct where all the kernel invocations to which the barrier applies are required to reach a particular step in their execution before any of them are permitted to continue from that step.) All the kernel invocations within a single work-group have the same work-group ID. Each kernel invocation within a given work-group has a unique work-item ID within the work-group (such that the work-group ID and work-item ID together uniquely identify the kernel invocation in question).
The work-group ID and work-item ID in OpenCL each consist of a tuple of 3 integers. These tuples thus define a 6-dimensional index space. When a kernel is specified for execution, it will also have defined for it the dimensions of its index space, i.e. the maximum value each dimension of the kernel invocation IDs can have for the kernel in question. The processor executing the kernel then executes one kernel invocation for each point within the index space specified for the kernel.
The size of the work-group integers defining the work-group ID is limited in OpenCL to 32 bits.
The size of integers defining the work-item ID is determined by the maximum permitted work-group size, which may, e.g., be 1024 work-items in any one work group. Each integer defining the work-item ID must be able to represent the total number of possible work-items in a work-group, as it will not be known how the work-items are distributed across the three dimensions of the work-item ID (they could, e.g., all lie along one dimension only). Thus, where a work-group can contain up to 1024 work items, each work-item integer will require 10 bits.
Thus, representing a full OpenCL kernel invocation ID in this manner would require:
3*32=96 bits for the work-group ID    and 3*10=30 bits for the work-item ID (where each work-group can contain up to 1024 work-items),which sums to 126 bits. FIG. 1 illustrates this and shows the fields required to denote an OpenCL kernel invocation ID in this manner.
When using kernel invocations, this 126-bit ID needs to be associated with, and tracked for, each kernel invocation. It can therefore represent a fairly significant overhead for the process.
It is known to try to reduce the cost of kernel invocation ID use in graphics processors by using an execution-unit approach that groups kernel invocations together into groups of 16 to 64 kernel invocations that then run in strict lockstep. This amortizes the cost of the invocation indexing across the group of “lockstepped” invocations. However, this technique cannot be used for kernels that feature divergent branching (as the invocations in the lockstepped group must be locked together).
It is also known in CPU systems to have each execution thread sequentially execute one kernel invocation after another. In this case, only one full kernel invocation index usually needs to be maintained in each thread, as this index can usually be updated with simple additions when the execution thread proceeds from one invocation to the next. However, while this is possible in CPU systems that may typically have only 4 threads per processing core, the cost of this technique becomes very high when the possible number of execution threads in the processing core becomes large (which is typically the case for graphics processors, where, for example, there may be 1024 threads per processing core).
The Applicants believe therefore that there remains scope for improved techniques for handling multidimensional indices, such as kernel invocation indexes, particularly where such kernels are being used with graphics processors and compute engines.