1. Field of the Invention
The present invention generally relates to data storage and more specifically to cache miss processing.
2. Description of the Related Art
Performance requirements are constantly increasing in data processing systems. Multiple processing units may be configured to operate in parallel by the execution of multiple parallel threads. For some applications the multiple parallel threads execute independently. For other applications, the multiple parallel threads share some data. For example, a first thread may compute an input that is used by one or more other threads. Finally, the threads may be organized in groups, where data is shared within each group, but not between groups.
Multithreaded parallel programs written using a programming model such as the CUDA™ C (general purpose parallel computing instruction set architecture) and PTX™ (a low-level parallel thread execution virtual machine and virtual instruction set architecture) provided by NVIDIA® access two or more distinct memory address spaces each having a different parallel scope, e.g., per-thread private local memory, per-group shared memory, and per-application global memory. The private local memory is implemented as a dedicated local storage and the per-group shared memory is implemented as a SRAM memory that may be accessed by all of the threads in a group. The global memory includes off-chip memory that may be cached.
Accordingly, what is needed in the art is a technique that reduces the dedicated storage used to provide the memory spaces that have each have a different scope.