Generally, in a parallel computer, a program is divided into a plurality of unit process programs (also called task), and each element processor is assigned with a part of the unit process programs and executes them in parallel with the execution by other element processors.
One of fundamental subjects of the construction of a parallel computer is how to distribute and share data. The distribution of data means that the data is distributed and allocated in physically distributed memory modules. The share of data means that the physically distributed and logically same data is accessed by a plurality of unit process programs.
Owing to the development of recent semiconductor technology, a large scaled parallel computer (with more than several hundreds and thousands element processors) heretofore not considered as available is not technically possible. With the advent of large scaled parallel computers, a parallel computer having a distributed type memory architecture has been desired, because a tightly coupled parallel computer with a centralized memory in which element processors physically share a memory and read data from and write data in the memory via buses or switches has the following problems: Memory access speed becomes low due to data transfer by way of bulky buses and switches; and there occurs access contention due to bank conflict. A first subject is how to assign data or program to each element processor of a distributed type parallel computer. According to a prior art (hereinafter called a first prior art) disclosed in "The IBM research Parallel processor Prototype (RP3) Introduction and Architecture, Proc. of the 1985 ICPP (1985), by G. F. Pfister, et al.", each element processor has a local memory to distributively store data or program to be processed. Addressing of distributed memories is generally executed using the local addressing scheme and the global addressing scheme. When the former, each element processor can read only its own local memory, while with the latter, each element processor can read the local memory of a desired element processor. The first prior art can adopt the both methods. Shared data, i.e., data accessible by a plurality of element processors, is located at a space (global space) accessible using the global addressing scheme. It is necessary for the access of data in the global space to translate the global address into the addresses of a local memory. To this end, the first prior art uses an interleaving translation by which consecutive data words in the global address space are translated sequentially, distributively and cyclically into the space of each element processor by using lower bits of an address code as an element processor number.
According to the global addressing scheme, each address determines one physical location. Thus, the share of data becomes the share of physical entity. In case of distributed memories, the distance (access time) by each element processor to a physical location differs. Therefore, if the element processor assigned with data differs from the element processor assigned with a program for accessing the data, the processing performance is considerably deteriorated. This problem can be solved by providing near the processor a cache memory for temporarily storing data. However, this method is hardly adopted due to so-called cache coherency that parallel read/write control of shared data in parallel cache memories is generally difficult. Namely, if data is changed by its element processor, and another element processor stores the data in its cache memory, the contents of data of the both element processors become different from and contradictory to each other. To avoid this problem, each time an element processor changes data, such change is broadcast to all the other element processors to make invalid the same data stored in the cache memories as that before the change. It takes a considerably long time to effect such broadcast and invalid processing in a large scaled parallel computer. Consequently, the first prior art basically prohibits the storage of shared data in a cache memory. There is known another prior art (hereinafter called a second prior art) for holding data near an element processor accessing the data, as described in "D. D. Gajski, et al: Cedar, COMPCON '84 Spring, pp. 306 to 309, 1984". According to the second prior art, there are provided a centralized type shared memory accessible by all element processors and a local memory provided for each element processor. Prior to executing unit process program by an element processor, the data required by the program is transferred from the shared memory to the local memory of the element processor. The results obtained through the execution of the program is again loaded in the shared memory. A unit process program is assumed as constructed of a series of instructions of about one iteration loop procedure.
The problems associated with the first prior art are as follows:
(1) The division and mapping of data (i.e., assignment of each part of the divided data to some element processor) is performed mechanically so that the parallel processing logics are not necessarily reflected. Therefore, it often happens that the data required by a unit process program assigned to an element processor is also assigned to other element processor. In such a case, the memories in the other element processors must be read through the network, resulting in not negligible delay.
(2) Address space becomes insufficient. Assuming that all element processors share a single address space, there arises a case where a real space exceeds a virtual space as the total number of element processors increases. For example, a system made of 512 element processors each having a 4 MB real memory has a capacity of 2 GB, which corresponds to a space allowed to be designated by the 31 bit addressing scheme adopted by a present typical, general-purpose large computer. Therefore, an increase in the number of element processors cannot be expected unless the addressing scheme is changed. In view of this, even the first prior art considers to use local space. However, there is not disclosed a method of sharing data while using local space as much as possible.
The following four problems are found in the second prior art.
(1) It takes time to transfer data between a shared memory and each local memory.
(2) The data shared by a plurality of unit process programs executed in parallel by different element processors must be located in a shared memory, otherwise a specific synchronization mechanism is required.
(3) Address space becomes insufficient similar to the first prior art.
(4) Since the execution order of a plurality of unit process programs by each element processor is controlled by a single control processor, the control load is concentrated on the control processor thus arising a fear of performance deterioration.