Massively parallel processing involves the utilization of hundreds or thousands of processing elements (PEs) linked together by high speed interconnect networks. Typically, each PE includes a processor, local memory and an interface circuit connecting the PE to the interconnect network. A distributed memory massively parallel processing (MPP) system is one wherein each processor has a favored low latency, high bandwidth path to one or more local memory banks, and a longer latency, lower bandwidth access over the interconnect network to memory banks associated with other processing elements (remote or global memory). In globally addressed distributed memory systems, all memory is directly addressable by any processor in the system. Since data residing in a processor's local memory can be accessed by that processor much faster than can data residing in the memory local to another processor, an incentive is therefore created for a placement of data which enhances locality. Typically, however, data distribution is limited within such MPP systems to a particular stride or to the placement of contiguous blocks of data. There is therefore a need in the art for a flexible addressing scheme which enhances locality for a variety of different processing tasks by efficiently and flexibly distributing data among a group of PEs.
The need to efficiently move blocks of data between local and global memory becomes even more apparent when attempting to optimize performance through the use of cache memory. Spatial coherence, the tendency for successive references to access data in adjacent memory locations, plays a major role in determining cache performance. Poor spatial coherence may exist if the access sequence to a given data structure is accomplished via a large stride (e.g., when accessing a two dimensional Fortran array by rows) or in a random or sparse fashion (e.g., indirect accesses, irregular grids). To achieve good performance, data often must be rearranged from a multitude of different large stride or sparse organizations to a unit stride organization. Furthermore, such a reorganization may require the shifting of data between remote and local memory.
Therefore, in addition to a flexible addressing approach to distributed data in an MPP system, there is a need for an efficient memory mapping of the global address to a local memory address within a PE. In particular, there is a need in the art for a mechanism which supports the above flexible addressing scheme and yet, at the same time, facilitates the reading and storing of data between local and global memory blocks in a massively parallel distributed memory processing system. The support mechanism should allow scatter-gather capabilities in addition to constant strides to facilitate reorganization of sparse or randomly organized data. The mechanism should also be easily directed by the user for adaptation to different types of processing tasks.