The present invention relates to a data transfer method in a computer system having a data transfer network, plural nodes each having at least one processor and a computer system suitable therefore.
In the parallel computer system, speeding up of interprocessor data transfer leads to increased speed of the entire system. There are two factors which determine the performance of the interprocessor data transfer: (1) data transfer rate and (2) transfer latency. Transfer latency is an overhead of hardware and the software processing needed to start the data transfer. When a large amount of data is to be transferred at a time, the data can be transferred at a high speed by improving the data transfer rate. However, when repeated transfer of short data is required, the performance is not improved even if the data transfer rate is high if the transfer latency is not reduced.
While the data transfer rate is decided by physical factors such as data transfer bandwidth, the transfer latency mainly depends on the transfer method for transfer of short data. So, the transfer method becomes important.
The prior art parallel computer system generally adopts the data transfer system which is called the SEND/RECEIVE type. In the SEND/RECEIVE type data transfer, when a data send processor executes a data send request instruction (SEND), send data is transferred to a destination node. A processor of the destination node accepts the send data by executing a data receive request instruction (RECEIVE). The SEND instruction designates an area (send area) from which the send data is read, and the RECEIVE instruction designates an area (receive area) into which the send data is stored. Generally, the starting logical address and the send data size are used for designating each of the areas. In the SEND/RECEIVE type transfer, there are the following two meritous features. The method is widely used from the workstation cluster to the massively parallel computer system. (1) It is easy to describe the program (asynchronous transfer) because the send node can execute the SEND instruction without depending on the timing of execution of the RECEIVE instruction at the destination node. (2) Until the receive node designates the receive area by a receive instruction, data is not stored into the receive area of the send node. So, the possibility is low that the data at the destination node is destroyed by a bug in a program at the send node.
However, there is a problem in the SEND/RECEIVE type transfer in that data sent from the send node cannot be stored into the receive area, until the destination node executes a RECEIVE instruction. Therefore, the SEND/RECEIVE type transfer generally requires buffering the send data in the destination node once.
A simple control method of this buffering buffers the send data by the destination processor once without fail and copies data for which a RECEIVE instruction has been executed among buffered data from the buffer onto the receive area. However, the performance is not so high according to this method, because the memory copy is generated for each data transfer.
Japanese Laid Open Patent Application No. HEI 6-324998 intends to cope with this problem by buffering only the data which arrives at the destination node before the RECEIVE instruction is executed, and storing in the receive area the data which arrives at the destination node after the RECEIVE instruction is executed. As a result, the memory copy is reduced and a higher performance is expected. In the parallel computer system of Japanese Laid Open Patent Application No. HEI 6-324998, the destination node should check when it receives the send data whether the RECEIVE instruction corresponding to the send data has already been issued and should search for information designated by the RECEIVE instruction when the instruction has already been issued, and should store the send data in the receive area designated by the information. Thus, in the SEND/RECEIVE type transfer, hardware/software processing needed in achieving the asynchronous transfer is abundant and the transfer latency does not become small.
There is PUT type data transfer provides as a solution to this problem. In PUT type data transfer, the data send request instruction (PUT) executed by the destination processor designates not only the send area but also the receive area. It generally uses the starting logical address and the send data size to designate the area. After the transfer starts, the send data is stored into the receive area unconditionally. So, there is no need for the destination node to issue the data receive request instruction, and no buffer for holding the send data is necessary either. In PUT type data transfer, there are problems in that (1) it must be assured by the user that the send node should execute the PUT instruction after the destination node has entered into a state ready to accept the send data (synchronous transfer), and that (2) the data of the destination node is easily destroyed when there is a bug in the program of the send node. However, it is possible to reduce the transfer latency because the processing required of the destination node only stores the data in the receive area according to the header information of the send data. Therefore, higher performance is expected than from the SEND/RECEIVE type for transfer of short data.
"Architectural support in the PUT/GET Interface for Parallelizing Compiler and Parallel Programs," Proceedings of Parallel processing symposium JSPP'94, pp.233-240, May, 1994 discloses a method of executing PUT type data transfer. In this PUT instruction, both the send area and the receive area are designated by virtual addresses. Here, we consider a case where the computers using the virtual address spaces execute the interprocessor data transfer. It is likely to occur that either the send area or the receive area does not exist on the main storage but on the external storage device such as hard disks, when virtual address spaces which exceed the real size of the main storage are employed or when the memory capacity required by plural processes which concurrently run on the send node or the receive node exceeds the size of the main memory. It is easy to realize with software/hardware that the send node does not start a transfer operation when the send area exists on the external storage device even partially, and starts the transfer after all the data in the send area has been loaded into the main storage. As a result, it is possible to send out all the send data without interrupting the sending operation.
However, the destination node cannot execute the receive operation when the data arrives at the destination node, if all of or parts of the receive area exist in the external storage device. As a result, the send data is reserved in the interprocessor network, which results in congestion in the network and becomes a problem with the system performance. Japanese Laid Open Patent Application No. HEI 6-110845 discloses means (real address fixation) which guarantees by an operating system (OS) that the receive area is sure to exist on the main storage. As a result, it is possible for the receive node to receive the send data without interrupting the transfer, which can prevent the congestion in the network.
However, in PUT type data transfer, the destination node cannot judge where the receive area is, until the send data arrives at the node. Therefore, it is necessary to keep the whole of the data area in the real address fixation state, if the area can possibly receive large data. For instance, it is necessary to keep all of the data receive area in the real address fixation state, if the area can possibly receive the matrix data and it is difficult to judge beforehand to which row the send data to be received belongs, even if the send data includes only one row of the matrix. This remarkably decreases the degree of freedom of virtual memory management.
The technology which does not keep the data area in the real address fixation state, even if the area has the possibility to receive the data is disclosed in United Kingdom Patent no. 2,271,006 which corresponds to Japanese Laid Open Patent Application No. HEI 4-291660 which corresponds to U.S. Ser. No. 08/126,088 or Hamanaka et al, U.S. Ser. No. 07/853,427, now U.S. Pat. No. 5,386,566 corresponding thereto. According to this prior art, the send data is buffered in the receive buffer controlled by an OS when the data receive area which should store the send data is swapped out and does not exist on the main storage, and hardware generates the interruption after the buffering ends. OS interrupts execution of the program being executed in response to the interruption, and executes the interruption processing prepared in OS. The receive area for this data is paged in during this processing and the send data held in the above-mentioned receive buffer is transferred to the data receive area as soon as the paging is over.
Besides the technology including swapping of the above-mentioned receive area, basic programming which uses the PUT type transfer is described next. In the system which executes PUT type data transfer, in order to allow the processor of the destination node to read data from the area into which the data will be stored from the send processor, it is necessary to realize synchronization between the program of the send node and the program of the destination node and to guarantee the end of the storing of the data. Otherwise, a mismatch will occur such as storing data in the receive area by the send node processor while the receive node processor is reading the receive area.
For instance, when the content of data M in node A and the content of data N in node B are to be exchanged, the processors of the node A and B execute the programs shown in FIG. 21. Each node A or B executes the instruction 3000A or 3000B to copy data M or N, respectively. The reason for this is that the PUT instructions 3020A and 3020B stores data directly in the data M and N for the swapping operation. It is necessary to first secure the area into which data is to be stored and thereafter to execute instructions 3010A and 3010B which realize the barrier synchronization between nodes A and B. Otherwise, there is a possibility that another node executes the PUT instruction 3020B (or 3020A) and stores the data, before the area which should be stored by the PUT instruction 3020A (or 3020B) is secured. Often it is impossible to secure beforehand an area into which data is to be stored, especially when a large amount of data is to be transferred, and that it is necessary to secure the area every time when PUT transfer is executed and immediately before that transfer and to realize synchronization. Next, each node executes the PUT instruction 3020A or 3020B. It is necessary for each node to execute the instruction 3030A or 3030B to realize the barrier synchronization again, before each node reads the data stored from another node. The reason for this is that there is no means for each node to know whether storing of data by the PUT instruction issued by the opponent node has ended. Both nodes can read the stored data (exchanged data), after realization of synchronization, because the completion of the PUT instructions by both nodes is guaranteed.
In the method disclosed by the above-mentioned Japanese Laid Open Patent Application No. HEI 4-291660 and U.S. Ser. No. 07/853,427 corresponding thereto, CPU is interrupted every time it receives data which should be stored in the swapped out data area and should execute the interruption processing. There is a problem that execution of the programs being executed at that time is interrupted. In particular, it is necessary to access the external storage device to swap in the swapped out pages. The interruption of the programs under execution increases due to the swapping in. It can occur at the time of the interruption that the programs being executed are not in the state to use the received data.
Therefore, the interruption of the programs under execution which are in such a state is not preferable, with regards to execution efficiency of the programs.
Furthermore, it can happen that the page which holds the receive data will be swapped out again, if the program being executed does not access the page for a while after the page has been swapped in. When the swapping out occurs again, the swapping-in immediately after the receipt of the data becomes useless.
Moreover, in the prior art PUT type data transfer, it is necessary to frequently execute the interprocessor synchronization which has large overhead like the barrier synchronization. When the interprocessor synchronization is executed frequently for the PUT type data transfer, it is difficult to make the best use of the merit of low overhead of the PUT type data transfer.