Consumer electronics platforms such as smart televisions (TVs), laptops, tablets, cell phones, gaming consoles, etc., may include hardware to render graphics and/or perform parallel computation tasks. Known frameworks that provide a top-level abstraction for hardware as well as memory and execution models to deal with parallel code execution may be scalar or single instruction multiple thread (SIMT) frameworks. For example, SIMT shader programs that break a problem into work performed in parallel by independent work items (or threads) may require access to shared memory (e.g., shared local memory, thread group shared memory, etc.) when the work items need to cooperate to compute a result. Hardware computing power and/or hardware performance may suffer since, for example, a load operation and/or a store operation involving shared local memory may be relatively inefficient.
Additionally, shared local memory may add complexity to programming frameworks. For example, Open Computing Language (or OpenCL, a trademark of Khronos Group) SIMT shader programs may include an async work group built-in function that requires a source and a destination to be explicitly defined, synchronization events to ensure safe read or overwrite of the shared local memory, and so on. In addition, SIMT programs may require that each work item in a sub-group (or thread in a warp) have its own memory address to access data for a load operation, a store operation, and so on. Moreover, SIMT application programming interfaces (APIs), which allow SIMT work items to share data using a shuffle built-in function, only apply to rearranging data in a register and not to operations involving, for example, data transfer.