1. Field of the Invention
The present invention concerns to methods of communicating data between computers in a computer system having a plurality of computers or data processing equipment connected through a communication network. More particularly, the invention consists of a method of sending/receiving data between memories of computers on a network in a which the hardware has the capability of transferring data between the memories of these computers.
2. Description of the Related Art
The TCP/IP protocol is used in the overwhelming majority of communications between computers, in particular in the communications in the Internet or in intranets. Since TCP/IP processing is not executed by the application, but is executed by the operating system, in order that the application perform communication using TCP/IP it uses an API (Application Programming Interface: the set of functions which an application calls in order to use a certain function of a computer or an operating system) called xe2x80x9cSockets APIxe2x80x9d (refer to the book by W. Richard Stevens, xe2x80x9cUNIX Network Programmingxe2x80x9d, Prentice Hill, U.S.A., 1990, ISBN 0-13-949876-1).
An example of the software structure of a host which performs communication using the TCP/IP protocol is shown in FIG. 1. The host 10 performs communication using the network 18. The kernel 120 of the operating system of the host 10 executes protocol processing 121 of TCP/IP and controls the communication hardware 11 in order to perform communication. The program 101 of the application 100 uses the Sockets API 90 to call the library 110. The library executes the system call 111 and calls the kernel 120. The kernel 120 sends and receives data 102 of the application 100 through the socket buffer 122.
Since protocol processing 121 in TCP/IP communication involves a large amount of processing, and the system call 111 and the copy between the data 102 and the socket buffer 122 result in overhead, these processings limit the communication performance in some cases. For this reason, computer systems requiring high communication performance, such as supercomputers or workstation clusters, employ networks which can transfer data between applications without performing protocol processing, system calls and data copies and also bypassing the kernel. In the present specification, henceforth, this communication method will be referred to as xe2x80x9chigh-speed communicationxe2x80x9d for short, when applicable. As an example of high-speed communication, there is the VIA (refer to the specification by Compaq Computer Corp., Intel Corp., Microsoft Corp., xe2x80x9cVirtual Interface Architecture Specification, Draft Revision 1.0xe2x80x9d, Dec. 4, 1997, http://www.Viarch.org). Since the functionality of high-speed communication is different from that of TCP/IP, their respective APIs are also different.
An example of the software structure of a host employing high-speed communication is shown in FIG. 2. The program 104 of the application 103 calls the high-speed communication library 130 by using the high-speed communication API 91 to send and receive data 105. By executing the communication processing 131 of the high-speed communication library 130, the high-speed communication hardware 12 is activated bypassing the kernel 120 to send and receive the data 105 through the high-speed communication network 19. When sending and receiving data by high-speed communication, two processings are required: the processing of inspecting whether or not the application 103 has the permission to access the data 105 which it wants to send or receive, and the processing to convert the virtual addresses which were specified by the application 103 into the physical addresses which are used by the high-speed communication hardware 12. For this reason the application 103, before sending and receiving data, calls the high-speed communication library 130 to register the data 105 to be sent and received (the registered data is shown in the form of a rectangle having rounded corners). The kernel performs the registration processing 123 in response to the call 132 of the high-speed communication library. As a result, it is possible to verify if the application 103 has access permission and, when it has the address conversion is performed and its result is registered in the memory registration table 13. The high-speed communication hardware 12 performs both the verification of the access permissions and the address conversion by using this memory registration table 13.
Since the high-speed communication API 91 is different from the Sockets API 90, in order that an application 100 employing the Sockets API 90 may use high-speed communication, this application 100 must be rewritten to use the high-speed communication API 91. Since this rewriting is difficult to do, many applications will remain unchanged, still using the Sockets API, thus they won""t be able to take advantage of the high performance of high-speed communication. In order to solve this problem, a communication method called xe2x80x9cFast Socketsxe2x80x9d, shown in FIG. 3, is employed. The Fast Sockets library 140 receives the call made from the application 100 through the sockets API 90 to execute the emulation processing 141 to communicate using high-speed communication. For this reason, it is possible to take advantage of the high performance of high-speed communication while keeping application compatibility. As examples of Fast Sockets, there is the method disclosed in JP-A-11-328134, the method by Berkely University (refer to the paper by S. H. Rodrigues, T. E. Anderson, D. E. Culler, xe2x80x9cHigh-Performance Local Area Communication With Fast Socketsxe2x80x9d, Proceedings of the USENIX""97, 1997, pp. 257 to 274), the method by Shah et al. (refer to the paper by H. V. Shah, C. Pu, R. S. Madukkarumukumana, xe2x80x9cHigh Performance Sockets and RPC over Virtual Interface (VI) Architecturexe2x80x9d, Proceedings of CANPC""9, 1991), Winsock Direct made by Microsoft Corp. (refer to the article xe2x80x9cWinsock Direct Specifications, on the Microsoft Windows Driver Development Kit (DDK)xe2x80x9d.
When data 102 of the application 100 is registered (800) to perform communication, a processing overhead (132, 123) of the buffer registration 800 occurs. When the data length is long, this overhead (132, 123) is shorter than the communication time, so high communication performance is obtained. On the other hand, when the data length is short, this overhead is longer than the communication time, so the communication performance is reduced. In order to solve this problem, the Fast Sockets library 140 on its initialization allocates a pre-allocated buffer 142 and registers (801). When communicating short data 102, this data is not registered, but is copied to the pre-allocated buffer 142 to perform the communication. In this case, despite the overhead of the copy, since the data length is short, and this overhead is small when compared to the registration processing, high performance can be obtained. While the pre-allocated buffer 142 is usually separated into buffers for sending and buffers for receiving data, these buffers are collectively shown in the form of one buffer 142 in FIG. 3 and the following figures of the software structure.
Above, the TCP/IP communication and the Fast Sockets have been described. While applications generally use TCP/IP communication (and as a result, the Sockets API), scientific computing applications use APIs such as MPI (Message Passing Interface Forum: refer to the standard xe2x80x9cMPI: A Message-Passing Interface Standardxe2x80x9d, 1995). Since MPI is independent of the computer architecture, when implementing MPI over high-speed communication, the calls made to the MPI API are mapped onto the calls of the high-speed communication API 91. As an example of a product implementing this mapping, there is MPI-Pro made by MPI Software Technology Inc. (refer to the paper by R. Dimitrov and A. Skjellum., xe2x80x9cEfficient MPI for Virtual Interface (VI) Architecturexe2x80x9d, Proceeding of the 1999 International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nev., U.S.A., Jun. 1999, Vol.6, pp. 3094 to 3100). The method to implement MPI is shown in FIG. 4. In this figure the program 107 of the application 106 using MPI transfers data 108 by using the MPI API 92. The MPI library 150 performs emulation 151 to do the mapping described above. Since the structure of the MPI implementation (refer to FIG. 4) is the same as that of the Fast Sockets implementation (refer to FIG. 3), the problems in the communication in both MPI and Fast Sockets are also similar. Unless otherwise indicated, the methods explained for Fast Sockets are also applicable to MPI. Conversely, the methods explained for MPI are also to applicable Fast Sockets.
The present invention has been made in order to solve the following five problems of communication libraries such as the Fast Sockets library and the MPI library (hereinafter, these libraries will be referred to as xe2x80x9cthe emulation librariesxe2x80x9d for short, when applicable).
The first problem is as follows. In prior art, the sending host selects, on the basis of the data length, the optimal one in the sending host between communication in which the data 102 or 108 is registered (800, 808), and communication in which the data is copied to the pre-allocated buffer 142, 152. However, the seconding host does not take into consideration which of them is optimal at the receiving host. Consequently, the reception processing performance of the receiving host is reduced.
he second problem is as follows. In the case where the call of data reception is executed ahead of receiving data in the receiving host, the receiving host can register (800) the memory region of the data to be received 102, 108 and notify the other endpoint on the data transfer of both of the address and the data length of this region. However, if the sending host receives this notification after having started sending this data, this notification becomes of no use, which results in processing overhead in the sending host and in the receiving host. As a result, since the network bandwidth is being taken by these notifications, the processing performance of the overall system is reduced.
The third problem is as follows. In prior art the two data transfer methods, i.e., writing the data by the sending host onto the receiving host and reading out the data by the receiving host from the sending host, and the four combinations of the communication by registering the data 102, 108 (800, 808), and by copying the data to the pre-allocated buffer, 142 and 152, respectively on both the receiving host and the sending host, for a grand total of eight combinations, can not be utilized simultaneously. For this reason, the performance of networks which realize high-speed communication can not be fully used.
The fourth problem is as follows. In prior art, the same communication method is used irrespective of the other party on the communication. However, in the future, it is conceivable that the other party on the communication is not the computer such as a server, but is a storage apparatus which uses the iSCSI (refer to the draft by J. Satran et alli., xe2x80x9ciSCSI (Internet SCSI)xe2x80x9d, Internet Engineering Task Force Internet-Draft draft-atran-iscsi-01.Txt, Jul. 10, 2000) (In the present invention, a communicating apparatus is referred to as a xe2x80x9chostxe2x80x9d irrespective of its kind). Since for a storage apparatus, the memory quantity which can be used in the pre-allocated buffer 142 is limited when compared to a computer and hence the processing performance may be poor in some cases, only same of the eight combinations described in the above-mentioned problem are efficient. Not limiting the communication method in accordance to the characteristics of the other party on the communication results, for example, in the case where the other party on the communication is a storage apparatus, the amount of memory required for this apparatus is increased, the sending and reception processing becomes complicated, the processing power required for the apparatus is increased, and thus the cost is increased.
The fifth problem is as follows. In prior art, when establishing a TCP/IP connection, the pre-allocated buffer 142, 152 is allocated, and in the communication after this, the characteristics of the buffer like its length are not changed. For this reason, the buffer amount cannot be adapted to the characteristics of this TCP/IP connection. For example, not increasing the size of the buffer when necessary can reduce performance. Also, memory regions which are registered (801, 809), as the pre-allocated buffer 142, 152 can not be swapped-out since it is the source/destination from/onto which the data is sent/received, and hence occupies the main memory. For this reason, not reducing the size of the buffer can reduce performance since this reduces the amount of memory which can be used by other applications.
In the light of the foregoing, the present invention has been made in order to solve the above-mentioned problems associated with the prior art, and it is therefore one object of the present invention to provide a data communication method by which the selection on the sending side of one of two kinds of data transfer operations, i.e., the operation of receiving data on the recerving side through a pre-allocated buffer and the operation of receiving data in memory region(s) registered for this data transfer is suitably done, thereby lowering the reception processing overhead on the receiving side.
It is another object of the present invention to provide a data communication method which is capable of reducing the overhead of the processing required to register memory region(s) for a specific data reception operation and to notify the sending side of its (their) reception address(es).
Other objects of the present invention will be made clear from the description of the preferred embodiments.
On a typical aspect of the present invention, it provides a communication method wherein a host on the receiving side notifies in advance a host on the sending side of the data length threshold which is used to select one between the data transfer with a pre-allocated buffer on the receiving side as the destination and the data transfer with the target memory region as the destination. When performing this specific data transfer, the host on the sending side performs the data transfer with the target memory region as the destination if the transfer data length exceeds the above-mentioned threshold. In this case, in the host on the receiving side is required to register the target memory region for this specific data transfer in the communication hardware. On the other hand, if the transfer data length does not exceed the above-mentioned threshold, then the host on the sending side performs the data transfer with the pre-allocated buffer as the destination. In this case, since this buffer was allocated when starting the communication between the hosts (or on a similar timing), the registration of the memory region that is the final destination of the transfer in the communication hardware is not required, it is only required to copy the data from the pre-allocated buffer to this final destination.
Which is the largest between the overhead of the registration of the memory region that is the final destination of the transfer and the overhead of the data copy from the buffer is the longest depends on the transfer data length, and also on the characteristics of the receiving host, and hence can not be judged by the sending side host. In the above-mentioned aspect, since the threshold for this selection is reported from the receiving host to the sending host, the selection of the transfer protocol can be suitably done.
In another aspect of the present invention, it provides a data communication method wherein a receiving host registers memory region(s) in which the specific data is to be received, and the efficiency of notifying the sending host of their address(es) is measured for a fixed period of time, and when the measured efficiency is low, the processing of notifying the sending host of these address(es) is restrained.