1. Field of the Invention
Embodiments of the present invention relate generally to parallel processing and more specifically to systems and methods for coalescing memory accesses of parallel threads.
2. Description of the Related Art
A typical parallel processing subsystem is capable of very high performance using a relatively large number of small, parallel execution threads on dedicated programmable hardware processing units. The specialized design of such parallel processing subsystems usually allows these subsystems to efficiently perform certain tasks, such as rendering 3-D scenes or computing the product of two matrices, using a high volume of concurrent computational and memory operations. Each thread typically performs a portion of the overall task that the parallel processing subsystem is configured to perform and accesses specific memory addresses.
As is well-known, to perform highly-efficient coalesced memory transfers, parallel processing subsystems often require memory read or write operations to consist of large, contiguous blocks of memory on aligned block boundaries. More specifically, if a particular thread group (i.e., a group of parallel threads) accesses a block of memory that is aligned to a multiple of the memory fetch size, and each of the threads in the thread group accesses a sequential portion of the block, then the parallel processing subsystem may perform a single coalesced memory transfer. However, if a particular thread group does not conform to these requirements, then the parallel processing subsystem will perform a non-coalesced memory transfer for each individual thread in the thread group. For example, suppose that a thread group of sixteen parallel threads accesses a contiguous block of memory, but the block is shifted by one byte from a multiple of the memory fetch size. The parallel processing subsystem will typically perform sixteen non-coalesced memory transfers, each of which satisfies a specific thread. Performing sixteen non-coalesced memory transfers is far less efficient than performing a coalesced memory transfer and, therefore, substantially reduces the overall performance of the executing application program.
In one approach to addressing the problem, an application developer may take into consideration the memory fetch size and write an optimized application program that enables coalesced memory transfers wherever possible. One drawback to this approach is that developing such an optimized application program involves a complicated programming environment and is time-consuming for the application developer.
As the foregoing illustrates, what is needed in the art is a technique for more efficiently and flexibly performing memory accesses for thread groups.