Parallel computing has been widely implemented in modern computing systems. To support efficient parallel computing, a number of parallel programming models have been developed for programmers to write code that executes across heterogeneous platforms; for example, such a platform may include a combination of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs), hardware accelerators, etc. Commonly used parallel programming models include Open Computing Language (OpenCL™), variants and extensions of OpenCL, and the like.
Generally, a parallel programming model is built upon a parallel computing platform model. One parallel computing platform model, such as the one adopted by OpenCL, includes a host coupled to a set of computational resources. The computational resources further include a set of computing devices, and each computing device includes a set of compute units. Each compute unit further includes a set of processing elements. Typically, the host executes serial code and issues commands to the computing devices. The computing devices execute parallel code, also referred to as a kernel, in response to the commands. A kernel is a function declared in a program, and can be executed in multiple threads by multiple processing elements. An instance of a kernel can be executed as a work item by one or more processing elements. The same kernel can be executed as multiple work items of the same workgroup. Work items of the same workgroup share data in a local memory, and synchronize with each other via workgroup barriers.
The parallel programming model described above is a powerful tool for programmers to write efficient parallel code. However, conventional hardware platforms are not tailored to certain kernels that have a specific data access pattern, and are not flexible with respect to the assignment of workgroups. Thus, there is a need for improving the efficiency of parallel computing systems.