A conventional multiprocessor system is shown schematically for example in FIG. 1 of the accompanying drawings. In transfer).
We have appreciated that a problem with the above system is caused by the manner in which the system hardware supports software organised data. Computer memory hardware is normally linearly addressed as a one dimensional array of words. Computer software on the other hand requires a variety of data structures of which the most common is the data array. The problem arises from the way in which a multi-dimensional data array maps onto the memory hardware. Items which are logically closely related in the software maybe scattered throughout the linearly addressed memory hardware. FIG. 2 illustrates the problem in the case of a 3 dimensional data array. A row of data in the X direction corresponds conveniently to a contiguous block in the memory hardware whereas a column of data in the Z direction is physically scattered throughout the entire memory. This effect can cause major delays within a parallel computer system because if a processor requires non-local data in any other direction than the X row direction the whole data array must be fetched to the local memory of the processor, which results in far more data being transferred than necessary. Alternatively, each element of the desired subset may be accessed individually creating many single word transfers which the hardware will not execute efficiently.
We have appreciated that another effect which leads to inefficiency in conventional systems takes place when several processors each require different but partially overlapping blocks of data. FIG. 3 illustrates a simplified example where four processors each require approximately half of the available data. Some areas of the source data are required by all four processors whereas other areas are required by two processors or just one processor. In a conventional parallel processor system the whole of the data has to be transferred to each of the local memories of processors, which wastes memory space. Alternatively the desired data blocks have to be unpacked, i.e. the data is arranged as separate blocks for each processor, and sent individually to each of the processors in turn, which would waste processor time and communication bandwidth.
In the above examples it has been assumed that the data source is a single memory. The situation becomes more complex when the source data array is mapped across the local memories of a number of processors, and it has to be re-distributed around the processors. This situation frequently arises from one algorithm stage to the next where the result data mapping from the first stage is not the same as the input data mapping required for the second. FIG. 4 illustrates this situation where four processors hold data in their local memories in row format (FIG. 4a) and the next algorithm stage requires the data in column format (FIG. 4b). The resulting data movements are quite complex even in this example of just four processes (FIG. 4c). In the above case a conventional system has to unpack and re-pack the data between the full array format and the sixteen sub-arrays illustrated. This is inefficient and wastes processor time.
An example of a prior art system is known from Japanese patent application no. 57-163924 to Hitachi, laid open on the 28th Mar. 1984 under the number 59-53964. In the system of that application a block of data can be sent from a control processor to several other processors simultaneously, under severe restrictions of the processors organisation. The system does not facilitate efficient re-organisation of data between the processors, and it does not overcome all of the problems described above.