In the past, one of the typical approaches to speeding up a computer division-by-a-constant operation has used a reciprocal multiply operation to replace the division operation for floating-point numbers. In such an environment, one can use a number of approaches to achieve good accuracy (e.g., one can use a couple of "Newton-Raphson iterations" to correct the result to a precision of within .+-.1 unit-in-the-last-place (ULP)). Floating-point precision is often measured in terms of ULP of the fractions, since the significance of that bit depends on the exponent. Floating-point results are also typically scaled and/or truncated automatically such that the maximum number of bits and accuracy are maintained all the time.
In integer division, however, situations often require a precision which does not allow the result to be off by one (results accurate to .+-.1 ULP are not sufficiently accurate). Also, scaling operations are not performed automatically. In integer division, there is no convenient representation of the reciprocal of an arbitrarily-sized integer, therefore the computer designer must take into account scaling, and must provide an method and apparatus which provide an exact result, since certain algorithms cannot afford a result which is off by one.
One area which requires exact results from an integer divide is in the placement and addressing of elements in an array which is being processed by a massively parallel processor.
Massively parallel processing involves the utilization of many thousands of processing elements (PEs) linked together by high-speed interconnect networks. A distributed memory processing system is one wherein each processor has a favored low-latency, high-bandwidth path to a group of local memory banks, and a longer-latency, lower-bandwidth access path to the memory banks associated with other processors (remote or global memory) over the interconnect network. Even in shared-memory systems in which all memory is directly addressable by any processor in the system, data residing in a processor's local memory can be accessed by that processor much faster than can data residing in the memory local to another processor. This significant difference in performance between access to local memory and access to remote memory prompts the performance conscious-programmer to strive to place any data to be accessed by a processor over the course of a program into local memory.
The need to efficiently move blocks of data between local and remote or global memory becomes even more apparent when attempting performance optimization using cache memory. Spacial coherence, i.e., the tendency for successive references to access data in adjacent memory locations, plays a major role in determining cache performance. Poor spacial coherence may exist if the access sequence to a data structure is accomplished via a large stride (e.g., when accessing a two dimensional Fortran array by rows) or in a random or sparse fashion (e.g., indirect accesses or irregular grids). To achieve good performance, data often must be rearranged from a multitude of different large-stride or sparse organizations, each dependent on the task to be performed, into a unit-stride organization, in addition to being moved between remote and local memory.
There is a need in the art for a mechanism which supports a flexible addressing scheme and facilitates the redistribution of data between local- and global-memory blocks in a massively parallel, distributed-memory processing system. The addressing support mechanism should allow scatter-gather capabilities in addition to constant-stride capabilities in order to facilitate reorganization of sparse or randomly organized data The mechanism should also be easily directed by the user for adaptation to different types of processing tasks.
In particular, there is a need in the art to remove power-of-two restrictions from the placement of data arrays across various PEs in a MPP system while retaining fast address calculation. For example, it is relatively easy and efficient to distribute the data of a 16-by-32-by-64 array across a three-dimensional torus MPP because each ordinate is an integer power of two, but relatively difficult and/or inefficient to distribute the data of a 17-by-33-by-65 array across such an MPP (to do so, the computer scientist often resorts to an array at the next larger power-of-two in each dimension, i.e., a 32-by-64-by-128 array, which wastes memory space).
In the system described in patent application Ser. No. 08/165,118 filed Dec. 10, 1992, now U.S. Pat. No. 5,765,181, and assigned to the assignee of the present invention, which is incorporated herein by reference, there is described hardware and process which provides a hardware address centrifuge to facilitate the reorganization and redistribution of data between remote and local memory blocks in a massively parallel distributed-memory processing system. In order to operate efficiently, however, data arrays must be placed on power of two boundaries. That allows one to calculate PE number and offset by simple bit manipulation. In one such embodiment of that invention, the bits comprising an index or address into a vector array are separated into two sets of bits, a first set comprising the PE number, and a second set comprising an offset into a portion of the memory of a PE. In order to spread the references, the bits of the first set and the bits of the second set are interspersed within the array index. The address centrifuge is used to separate the two sets of bits and to "squeeze" out the spaces between the separated bits, thus resulting in a PE number and an offset.
None of the prior art provides a convenient and fast way to provide a divide-by-a-constant. None of the prior art provides a convenient and fast way to eliminate the power-of-two restriction on array addresses being processed by a plurality of processors.