Many programming models (for example, in graphics processing units (GPUs), heterogeneous computing systems, or embedded architectures) have to control access to multiple levels of a memory hierarchy. In a memory hierarchy, certain memories are close to where the operations happen (e.g., an arithmetic logic unit), while other memories are located farther away. These different memories have different properties, including latency and coherency. With latency, the farther away the memory is located from where the operation happens, the longer the latency. With coherency, when a memory is located closer to the chip, it may not be able to see some reads and writes in other parts of the chip. This has lead to complicated programming situations dealing with addresses in multiple memory spaces.
Container partitions in models like STAPL (Standard Template Adaptive Parallel Library for C++) or split methods in memory hierarchy languages such as Sequoia provide data splitting that allow data movement, but do not abstract temporary data in the same way. Memory spaces in programming models like OpenCL provide access to these hardware structures, but in a free-form way with no clear method for passing substructures in to and out of executing computational kernels or to provide flows of dependent blocks from one kernel instantiation to another.
A distributed array is an opaque memory type that defines a global object containing a set of local arrays, where one local array is mapped to each executing group of work instances (programmatic units of execution such as OpenCL work items, CUDA™ threads, or the instances of the body of a parallelFor execution). Each group of work instances has a higher ability for intra-group communication than inter-group communication, and hence all work instances in the group may share access to the local part of the distributed array.