The present invention relates to an interprocessor communication system to realize a virtual memory for a parallel computer system, in which a plurality of processors is interconnected through a network.
Interprocessor communication is imperative for parallel processing through which plural processors process a single job. Here, the interprocessor communication refers to a data exchange between plural processors. The interprocessor communication is created in the course of the parallel processing between the plural processors to ensure proper cooperation between them. For this reason, the interprocessor communication is one of the overheads in the parallel processing. Accordingly, a high-speed performance of an interprocessor communication is unavoidably required to promote the effect of the parallel processing.
When plural users handle a computer system made up of parallel processors, it has been common to employ a so-called space-sharing method. In this method, the computer system is spatially divided in such a manner that the processors are individually allotted to each of the users to permit each processor to serve for each user and that a job in charge of each user is allowed to run exclusively in his allotted processors.
In this method, however, there is a limitation that the total number of processors which execute each user""s job cannot surpass the number of the physical processors.
As one solution to overcome the above-described limitation and thereby to allow plural users to efficiently handle the computer system, the time-sharing method has been employed to time-share each of the parallel processors in the computer system.
In order to execute timesharing, however, support of a virtual memory will be essentially required. Here, the virtual memory refers to an external storage medium such as a hard disk to save the memory images of the jobs allocated to the plural users.
The reason for requirement for a virtual memory is as follows:
A promotion in performance capability of a processor has brought about a striking increase in a data size that a processor can handle for a calculation in the field of science and technology. In many cases, the distribution of data necessary for parallel processing is effected by first transmitting the data from a hard disk or a host machine to one of the processors (referred to as a first processor below), subsequently dividing the data into data sections in the first processor and distributing the data sections individually to the other processors. While the main memory of each processor other than the first processor stores a divided data section, the main memory of the first processor will be likely to overflow if the size of the data loaded from the hard disk or the host machine is too large. To solve this problem, it is imperative to support (to realize on software) a virtual storage capable of handling data of a data size that is larger than the capacity of the real main memory, using an external storage medium such as a hard disk.
In order to support a virtual memory in a parallel computer system, a problem encountered has been that speed-up of the interprocessor communication adversely results in a relative slow-down of the processing speed when paging-in a paged-out page from the disk to a main memory. This entails a temporary suspension of the interprocessor communication to wait for paging-in of the following page. The suspension of the interprocessor communication between certain two processors affects problematic influences on the interprocessor communications between other processors.
FIG. 1 is an explanatory diagram of page-in in a conventional parallel computer system: (A) represents a real memory with the entire area set up for page-in; (B) represents a real memory with a communication buffer allocated; and (C) represents a real memory with transmission area first paged-in on the transmission side and a reception area next paged-in on the receiver side through interprocessor communications. By the methods represented in FIG. 1, (A) to (C), even when the concerned processor breaks an interprocessor communication to carry out a page-in processing, such a break will not interfere with the communications between other processors as described below.
The traditional method 1 shown in FIG. 1, (A) gives a solution in which no virtual memory is employed. In this solution, it is not permitted to use any area within the main memory exclusively for the interprocessor communication. Thus, all the programs including the interprocess communication programs can be allocated to any area of the main memory and the page-out of the program from the main memory is prohibited. This solution entails the problem that the solution does not meet the above-described requirement of supporting the virtual memory for the parallel computer system.
The traditional method 2 shown in FIG. 1, (B) teaches a transmission buffer and a reception buffer for the interprocessor communication fixedly allocated on the transmission side and on the receiver side, respectively. In this method, data for an interprocessor communication is always communicated exclusively through the transmission buffer and a reception buffer. In other words, all steps of the interprocessor communication are carried out in a real-memory-to -real-memory scheme and no step for page-management between the main memory and a virtual memory intervenes in the process of the interprocessor communication. In this way, the interprocessor communication can be performed without interruption by the page management. However, a problem encountered in this method has been that the method requires the steps of copying data from a virtual memory to the transmission buffer before the interprocessor communication and also copying data from the reception buffer to a destination address after the interprocessor communication, with the two copy steps causing a degraded performance of the interprocessor communication.
In the traditional method 3 shown in FIG. 1, (C), a transmission area and a reception area are allocated beforehand every time interprocessor communication is intended. While data can be sent to the destination address without necessitating copying in this method, a problem encountered has been that additional steps of the interprocessor communication are required between the processors of concern to confirm the allocation of the data reception area in the real memory on the receiver side. These steps for confirmation substantially cause a degraded performance of the interprocessor communication, although the method requires no copy step.
The above-described traditional methods are summarized as follows: the traditional method 1 cannot support a virtual memory; in the traditional method 2, the copying of data to the transmission buffer and the copying of data from the reception buffer to a destination address affect the interprocessor communication as an overhead that degrades significantly the performance of the interprocessor communication; and in the traditional method 3, an additional interprocessor communication is needed to execute the confirmation procedures to confirm an allocation of the transmission area and the reception area to the real memories on the transmitter side and the receiver side, respectively. These confirmation procedures have to be carried out every time the interprocessor communication is intended even when the allocation of the transmission area and the reception area is maintained. The above-described additional interprocessor communication causes degradation in the performance of the interprocessor communication.
The present invention is intended to solve the above-described problems. In particular, the present invention is directed to minimizing the procedures that cause deterioration of the performance in the interprocessor communication when the memory areas associated with the interprocessor communication have been allocated to the real memory, thereby offering an interprocessor communication system having an improved performance.
The present invention is directed to the interprocessor communication system for a parallel-computer system with plural processors combined through a network, each of the processors having a CPU, a main memory, a transmitter device and a receiver device, and communicating with other processors according to an interprocessor communication program in which a read address and a write address of transferred data are designated in terms of a logical address.
In order to attain the above-described object, each of the processors includes a translation means and a buffer means:
The translation means is provided in each of the transmitter device and the receiver device for translating a logical address to a physical address; and the buffer means is provided in a specific position on the main memory to store specific information concerning a fault page.
The specific position is designated both by the communication ID assigned to each task and by the source logical processor number of the processor from which a packet is sent.
The buffer means comprises: a flag section for storing flag information indicating an occurrence of the page fault during the interprocessor communication from a source logical processor to the task to which the communication ID is assigned; a field for entering the logical page number information of a page in which the latest page-fault takes place; a field for entering the number of logical pages in which page faults take place during the interprocessor communication concerned; and a field for entering an address information that indicates an area to store logical page numbers of page-faulty logical pages.
The interprocessor communication system described above enables the performance of programming without being conscious of the size of the area to be used for the interprocessor communication in the main memory. This is because, while the buffer means fixedly occupies the main memory, substantially any memory area can be used for the interprocessor communication programs and can be managed under the virtual memory concept. Thus there is no need to restrain the area in the main memory for use in the interprocessor communication, as with the case of the traditional method 3 described above.
Preferably, the main memory is further provided with a fault-page list for storing said logical page numbers of page-faulty logical pages. If the address translation made by the translation means provided in the receiver device results in the state of page-out, then the translation means stores the data concerning the result of the address translation in the section and fields of the buffer means, and enters the logical page number of the logical page that is subjected to the address translation into the fault-page list. In addition, it is preferred that, if the result of said address translation indicates the state of a page indicative of page-out, the logical page subjected to the address translation is assigned to a dummy physical page designated beforehand.
This constituent feature offers an advantage of preventing the network from being blocked even when a page fault takes place. This is because writing data to a page-faulty page results in writing the data to dummy page through an ordinary data-receiving processing. The page-faulty data can be thrown away by discarding the dummy page.
In the interprocessor communication system, the translation means preferably has, as information of a logical page that constitutes a part of each logical address, a physical page number of the physical page assigned to the logical page and the state of the physical page, wherein the state of the physical page comes in the following specific state of the page:
(i) the state of xe2x80x9cinvalidxe2x80x9d in which the translated physical page is not assigned to the main memory;
(ii) the state of xe2x80x9cpage outxe2x80x9d in which the translated physical page is currently absent in the main memory and is paged out to an external storing medium;
(iii) the state of xe2x80x9con paging-inxe2x80x9d in which the translated physical page was previously in the state of xe2x80x9cpage outxe2x80x9d and is currently in page-in processing;
(iv) the state of xe2x80x9cpage inxe2x80x9d in which the translated physical page has been assigned to the main memory.
This constituent feature enables the treatment of a page fault without deteriorating the performance of the interprocessor communication. This is because the state of logical page is classified into the states of xe2x80x9cinvalidxe2x80x9d, xe2x80x9cpage-outxe2x80x9d, xe2x80x9con page-inxe2x80x9d and xe2x80x9cpage-inxe2x80x9d, thereby enabling the confirmation of the page fault to be performed smoothly.
In the interprocessor communication system of the present invention, it is preferred that the translation means has entries of communication IDs, logical page numbers of logical pages, the physical page numbers of the physical pages assigned to the logical page and the states of the physical pages. The translation means is also provided with comparison means for comparing the communication ID and logical page to be translated with the entry communication IDs and the entry logical page numbers, respectively. Based on the comparisons, the translation means provides outputs of the physical page number corresponding to the logical page number of the logical page to be translated and also the state of the physical page. During an interprocessor communication, the translation means on the receiver side translates a received communication ID that corresponds to a task and a received logical page number to the corresponding physical page number and the state of the physical page.
The receiver device is preferably provided with a second comparison means for comparing the logical page number to be translated with the logical page number entered in the buffer means to judge whether or not the logical page number to be translated coincides with the logical page number of the latest page-faulty logical page. In the case that the translation means indicates, as the state of the physical page assigned to the logical page to be translated, the state of xe2x80x9con page-in processingxe2x80x9d, if said second comparison means judges it as being coincident, then the receiver device performs no processing. In the same case, if the second comparison means judges it as being incoincident, then the receiver device raises interrupt in order to ask the OS to add the logical page number to be translated to the fault-page list to store the logical page number of concern as the latest page-faulty logical page and to increase the number of the page-faulty page entered in the field for entering the number of logical pages in the buffer means by 1.
The constituent features described in the previous paragraph allow fewer interrupts to be raised by the receiver device even when page faults take place. This is because comparing the logical page number of the latest page-faulty page entered in the buffer means with the logical page number of concern obviates the necessity of plural interrupts to treat the page fault of the same page.
The source processor preferably transmits a check packet after one data is transmitted. The header of the check packet has a format including a field to enter a swapping data and also a field to enter a write address for a reply packet in reply to the check packet. The destination processor, when receiving the check packet, swaps the swapping data with the entry of said buffer means to provide a reply to the check packet and returns the reply to the source processor to store the entry of the buffer means in the area of the main memory designated by the write address.
Further it is preferred that the source processor has means for checking whether or not a page fault takes place on the reception side of the interprocessor communication in the latest data that has been transmitted from the source processor.
The features described in the last two paragraph above allow the obviation of the deterioration in the performance of the interprocessor communication caused by the confirmation procedure of the page faults. This is because the confirmation of the page faults is carried out only once after a great amount of data has been sent and also because the receiver device carries out the packet processing to confirm the occurrence of the page fault at a high speed.
It is preferred that the fault-page list stores the number of the page-faulty logical page successively entered when the page fault occurs during the interprocessor communication of concern. The source processor reads the fault-page list by means of a remote read protocol, and retransmits exclusively the data of the logical page number entered in the fault-page list.
The feature described in the preceding paragraph above offers an advantage of enabling the reduction of a load of the OS when a page fault takes place. This is because using the remote read protocol allows the user task to read the fault-page list and to resend the data that has not been received due to a page fault without any operation of the OS.
It is preferred that after the retransmission, the source processor confirms presence or absence of a page fault and repeats the retransmission until the processor confirms absence of the page fault.
The above and other objects, features and advantages of the present invention will become apparent from the following description referring to the accompanying drawings which illustrate an example of a preferred embodiment of the present invention. dr
FIG. 1 is an explanatory diagram of page-in in a conventional parallel computer system;
FIG. 2 is a block diagram to illustrate an embodiment of the present invention;
FIG. 3 represents the formats of a remote DMA parcel header for a communication in the parallel computer system;
FIG. 4 shows the format diagrams of a parcel of a check packet and the reply parcel in reply to the check packet;
FIG. 5 is a block diagram to illustrate details of TLB;
FIG. 6 is a diagram to illustrate the check buffer entry formats;
FIG. 7 is a flow chart illustrating a procedure described in the communication library according to the present invention;
FIG. 8 is a flow chart to illustrate the processing of the page fault on the transmitter side;
FIG. 9 is a flow chart to illustrate the procedure to deal with a page fault on the receiver side;
FIG. 10 represents the format diagrams of a remote read parcel and the reply parcel in reply to a check packet; and
FIG. 11 is a flow chart illustrating the procedures executed by the communication library.