1. Field of the Invention
The present invention generally relates to computer architectures and, more specifically, to efficiently implementing arrays of structures on SIMT and SIMD architectures.
2. Description of the Related Art
A class of parallel processing subsystems (PPSs) such as single-instruction multiple-thread (SIMT) or single-instruction multiple-data (SIMD) processors execute each instruction on a group of parallel threads or parallel data lanes. Such parallel processors benefit because the same instructions are performed on various data sets in a highly parallel manner. Alternatively, parallel execution of a large number of generally synchronized threads is performed, using a common instruction unit where different threads may follow divergent execution paths through a given thread program.
Traditionally, data sets for SIMT and SIMD processors are stored using a C-style construct called an array of structures (AoS). In an array of structures, the various fields within a first structure are sequentially stored. The second structure immediately follows the first structure, the third structure follows the second structure, and so on. With this storage pattern, the threads operating within the PPS efficiently access adjacent fields from a particular structure because they are located in sequential locations in memory. However, where threads access the same field in sequential structures for various operations, an array of structures results in a sparse memory access pattern where addresses accessed by each thread are separated from neighboring lanes by a distance equal to the size of a structure. Such accesses are generally much less efficient than memory accesses that are sequential for adjacent SIMT thread and SIMD data lanes. The inefficiency arises because the addresses accessed by parallel threads or data lanes are spread apart with a stride length equal to the size of each structure within the array of structures.
Alternatively, the data sets may be organized as a structure of arrays (SoA). In such an organization, the first fields of each structure are allocated adjacent storage locations in memory. The second fields are then allocated sequential locations and so on until all fields of all structures are allocated a storage location. The threads operating within the PPS efficiently access specific fields from adjacent structures, as they are located sequentially in memory. However with this approach, accesses to adjacent fields in a particular structure are separated by a distance proportional to the number of structures. This distance can be significant in the case of large data sets. Therefore, accesses to fields across a given structure result in a sparse memory access pattern reducing efficiency.
One solution to this problem is for programmers to write program source that allocates fields within a particular structure and particular fields in adjacent structures for more efficient access. Programmers may use a hybrid between AoS and SoA using the number of available thread/data lanes to optimize access to adjacent fields in a structure and to a particular field in adjacent structures. However, this programming approach has certain drawbacks. First, the programmer typically knows and utilizes the architectural details of the PPS in order to choose a memory allocation approach that is optimized for the particular PPS. Second, a PPS on one system may have different architectural details than the PPS on another system. Program source code optimized for the PPS architecture on one system may not run efficiently on another system with a different PPS architecture. Third, PPS architectures may change over time. Program source code optimized for a current PPS architecture may not run efficiently on a future version of that PPS with different architectural details.
As the forgoing illustrates, what is needed in the art is a more optimized way to store data being accessed by parallel processing subsystems.