1. Field of the Invention
Embodiments of the invention relate generally to address mapping and, more specifically, to thread address mapping for a parallel thread processor.
2. Description of the Related Art
Performance requirements are constantly increasing in data processing systems. Multiple processing units may be configured to operate in parallel by the execution of multiple parallel threads. For some applications, the multiple parallel threads execute independently. For other applications, the multiple parallel threads share some data. For example, a first thread may compute an input that is used by one or more other threads. The threads may be organized in groups called cooperative thread arrays (CTAs), where data is shared among the threads of each CTA, but not between CTAs. Finally, a parallel thread processor may group multiple parallel threads together in thread groups called warps, using single-instruction multiple-thread (SIMT) or SIMD techniques.
Multithreaded parallel programs written using a programming model such as the CUDA™ C (general purpose parallel computing architecture) and PTX™ (a parallel thread execution instruction set architecture) provided by NVIDIA® access two or more distinct memory address spaces each having a different parallel scope, e.g., per-thread private local memory, per-CTA shared memory, and per-application global memory. The programmer specifies the memory address space in each variable declaration and typically uses a load and store instruction specific to that memory address space when accessing the variable. For example, three different sets of load/store memory access instructions may be used to access three distinct memory spaces that have different parallel sharing scope. A first set of load/store memory access instructions may be used to access thread-local memory that is private to each thread. A second set of load/store memory access instructions may be used to access shared memory that is shared between all threads in the same CTA. A third set of load/store memory access instructions may be used to access global memory that is shared by all threads in all CTAs. However, requiring programs to provide separate instruction sequences that depend on the type of memory that is being accessed is highly inefficient.
Accordingly, what is needed in the art is a technique that enables a program to use a common load or store instruction to access memory spaces that each have a different scope.