Conventionally, an amount of on-chip scratchpad memory (e.g., 16 KB, 32 KB, or 64 KB in size) is allocated from a pool of scratchpad memory and assigned to a set of one or more execution contexts (“SEC”) (e.g., where an execution context is a thread or a process) that execute a kernel serially and/or in parallel, where a kernel is a set of instructions such as a program. In one example, a set of one or more execution contexts is a Cooperative Thread Array (CTA). The scratchpad memory allocated to a SEC is private to the SEC and the data stored in the scratchpad memory is not persistent once the SEC finishes execution of the kernel. Also, there is no automatic backing memory for the scratchpad memory. Therefore, in a conventional system, the data stored in scratchpad memory is exchanged between different kernels by having each SEC executing one kernel explicitly copy the data from its allocated scratchpad to global memory and each SEC executing another kernel explicitly retrieving the data from global memory.
To reach the global memory (e.g., DRAM or cache), a memory hierarchy of caches (e.g., L1, L2, etc.) is typically traversed, so that the data is transferred through one or more levels of caching. Compared with the latency of accessing the scratchpad memory, exchanging data through global memory may be at least an order of magnitude slower. Furthermore, because the scratchpad memory is private to each SEC, different SECs executing the same kernel must also exchange data through the global memory. Hence, a need exists to allow SECs and/or kernels to exchange data more quickly than through conventional approaches.