1. Field of the Invention
Embodiments of the present invention relate to methods and systems for efficiently sending data between the computers in a high performance computer network. More specifically, the embodiments of the present invention relate to methods and systems for linking distributed multi-processor applications and distributed shared memory subsystems.
2. Description of the Related Information
Communication between software entities (applications) on different host computers is frequently carried in packets over standard transmission protocols, such as TCP. Many application programs may be running concurrently on each computer, and methods have been developed to allow such programs to communicate independently. The operating system in each computer, specifically the part of the operating system referred to as the “operating system kernel” or “kernel,” has the task of managing the processes under which the application programs run. The kernel also provides the communications services for the entire computer: it mediates between the application programs and the hardware such as Ethernet interfaces that provide the circuitry for receiving and sending data packets. An example of an operating system so structured is LINUX, as discussed in Distributed Shared Memory Programming, by Tarek El-Ghazwi et al., John Wiley & Sons, 2005, ISBN 0-471-22048-5, which is hereby incorporated by reference in its entirety.
In a system such as a massively parallel multi-processor system, or “super computer,” a large number of communication paths may be required to carry data from the memory of one computer to the memory or CPU of another. A common example of a distributed application in which such data communication occurs is the computation of certain mathematical algorithms such as matrix multiplication. This may involve many computers with each computer having a data communication path established with many or all of the other computers.
A method of programming a super computer is based on the UPC (Unified Parallel C) programming language, which provides programmers with the capability to write a program that will run on the multiple CPUs of a super computer while using the memory units of the CPUs as a shared distributed memory. To effectively share the memory, the CPUs are connected through a data network that may be based on TCP or a proprietary protocol. TCP may be selected because it is a widely available and standard connection oriented protocol. Conventionally, each CPU includes an application environment (application space) and an operating system environment (kernel space). For one CPU to access the memory of another then requires a data communications path to be set up, e.g. a TCP connection.
FIG. 1 illustrates an exemplary and conventional multi-processor system 10 comprising a number of CPUs (CPU1 12 and CPUn 14 only shown) and a network 16. The CPUs may contain many hardware and software components, but only few are illustrated here to briefly describe the role of inter-processor communication. The CPU1 (12) includes a memory 20, an application 22, a socket 24, a kernel 26, and a packet interface 28. The CPUn (14) similarly includes a memory 30, an application 32, a socket 34, a kernel 36, and a packet interface 38.
For example, the application 22 in the CPU1 (12) may have set up a data connection 40 between the socket 24 and the socket 34 in the CPUn (14). The applications 22 and 32 may have been compiled with the UPC programming language and the applications 22 and 32 may be copies of the same program running independently in the two CPUs 12 and 14. Through the sockets 24 and 34, the applications 22 and 24 are then able to exchange data over the data connection 40.
The data connection 40 may be carried in a standard TCP connection established between the kernels 26 and 36 in the respective CPUs over the corresponding packet interfaces 28 and 38. The packet interfaces 28 and 38 may be Ethernet interfaces, and the network 16 provides the physical connection between the packet interfaces 28 and 38 in a known manner.
The sockets 24 and 34 provide the software interface between the application 22 and the kernel 26, and between the application 32 and the kernel 36, respectively. They further provide the application 22 and the application 32 with a virtual connection representation regardless of the underlying protocols and physical networking facilities used.
In this way, the application 22 is able to read data from the memory 30 that is associated with the application 32 in the CPUn (14), when required by the program. Note that such read operation may require protocol support at the CPUn (14). It may be recognized that this method for the application 22 to read data from the memory 30 may be cumbersome, especially when large amounts of data have to be shared by applications. The application program may have to wait frequently as a result of the delay in obtaining data from a memory on a different CPU, the delay being a combination of the transmission delay through the network and the processing delays in each CPU. Network and transmission delays are being improved by newer, higher speed technology. But the complexity of the existing kernel software that interfaces the packets to the applications is becoming a bottleneck in high performance computer systems.
In order to deliver the payload of a received packet to the intended application for example, the kernel needs to determine from the header of the received packet, the socket ID through which the application communicates with the kernel for each connection. The kernel can further determine the destination application through the information stored in the socket data structure. Where there are many processes, and potentially many open ports or sockets, this may involve a large number of instruction cycles in the kernel to scan or otherwise search the lists of sockets, in order to associate the correct destination (application) with each received packet before it can deliver the received packet data to the application.
FIG. 2 is a simplified flow chart 100 illustrating a typical method by which an application in a multi-process environment receives data from a data link using a data transport protocol such as TCP/IP. The flow chart 100 shows a kernel space 102 and an application space 104. Shown in the application space 104 are sequential steps 106 “Application Establishes Socket Connection” and 108 “Application Makes System Call (Receive).” A system call 110 links the step 108 “Application Makes System Call (Receive)” to a step 112 “Application Blocked, Waiting for Data” in the kernel space 102. A step 114 “Copy Data to Application Memory” in the kernel space 102 is linked by a “return” link 116 back to a step 118 “Application Processing Data” in the application space 104.
Also shown in the kernel space 102 are sequential steps:                120: “Packet Arrives from Network”;        122: “Read Packet Header”;        124: “Process Protocol Elements”;        126: “Locate Destination Socket”;        128: “Unblock Application”; and        130: “Reschedule Application.”        
Straddling the kernel space 102 and the application space 104 are a data structure 132 “Socket” and a data structure 134 “Application Data Memory.” The steps 106 “Application Establishes Socket Connection,” 108 “Application Makes System Call (Receive),” and 126 “Determine Data Destination in Application Memory”, all access the data structure 132 “Socket.” The data structure 134 “Application Data Memory” is accessed by the steps 128 “Copy Packet Payload to Destination” and 118 “Application Processing Data.” In operation, the application 104 communicates with the kernel 102 through the ID of the Socket 132. The Socket 132 is a data structure that is managed by the kernel 102 and is associated with the process (not shown) under which the application 104 runs. The Socket 132 is created by the kernel 102 when the application 104 first requests and establishes packet communication with the remote end, and is subsequently used by the kernel 102 to link received packets back to the application 104. In the multi-process environment, the kernel may serve many sockets and many processes (applications) which may simultaneously be in a state of waiting for data.
FIG. 3 illustrates the format of a typical packet 140, having a packet header 142 and a packet payload 144. Information in the packet header 142 is, in a general sense, used to route the packet to the intended destination. The packet payload 144 is destined for the Application Data Memory 134 (FIG. 2) of the receiving application 104. The packet header 142 may be comprised of a number of sub-headers (not shown) to facilitate routing over a network to the intended destination computer (not shown) in the well known manner. When the packet 140 arrives at the destination computer (step 120 “Packet Arrives from Network”) the information in the packet header 142 is then used by the kernel 102 to determine the final destination of the packet payload 144, i.e. the socket data structure for receiving the packet payload and eventually an application receive buffer in the Application Data Memory 134 of the application 104.
Continuing with the description of FIG. 2: when a packet arrives (the step 120 “Packet Arrives from Network”), the payload data of the packet will ultimately be copied into the Application Data Memory 134 by the kernel 102 (the step 114 “Copy Data to Application Memory”). This happens only after the destination application (104) has been rescheduled to run while the processor is still running in the kernel before returning to the user space.
The actions of the kernel 102 from the step 122 to the step 114 are as follows: In the steps 122 “Read Packet Header” and 124 “Process Protocol Elements” the header is parsed, i.e. relevant fields are extracted, and protocol specific data structures (not shown) are updated as defined by the protocol used. For example, the TCP protocol described in IETF-rfc793 (which is incorporated herein by reference in its entirety) requires numerous actions to be performed upon receipt of every packet. In the step 126 “Locate Destination Socket”, the socket data structure of the target application is determined which, in turn, provides process and memory address information of the target application 104 for use in later steps. Port numbers and other information in the packet header 142 is used in the step 126 “Locate Destination Socket” to find the memory location of the socket data associated with the received packet. The process ID identifies the application that should receive the packet payload, and is determined from the Socket Data in the step 126 “Locate Destination Socket.” The process ID leads to the process data structure which may be located by a lookup or a scan of a table of active process IDs. The process context, in the form of the Process Data Structure, is retrieved (see the step 112 “Application Blocked, Waiting for Data” in FIG. 2) in the step 128 “Unblock Application” and activated in the step 130 “Reschedule Application.”
Restoring the process context of an application is commonly referred to as context switching. This happens when the concerned process is selected to run next. The major part of this is switching of the virtual address space (changing of paging table) if the kernel is not currently running in this process' virtual address space. Finally, in the step 114 “Copy Data to Application Memory,” the kernel is ready to obtain the memory address for delivery of the packet payload into the application data memory 114 (FIG. 2).
FIG. 4 is an expansion of the steps 114 “Copy Data to Application Memory” from FIG. 2, into the following steps:                160 “Obtain Process ID from Socket Data Structure”;        162 “Load Process Context”;        164 “Get Destination Memory Address”; and        166 “Copy Data.”        
Having determined the destination address (step 164) by way of the Process ID and the Process Context (steps 160 and 162), the data contained in the packet payload 144 (FIG. 3) is stored (copied from the system buffer) into the Application Data Memory 134 in the final step 166 “Copy Data.” Having delivered the data, the kernel 102 may immediately return (link 116) to the step 118 “Application Processing Data” in the application 104, i.e. giving up control to the application 104 running in user space (application space), unless it is preempted by another process or kernel thread of higher priority.
To summarize briefly, computer-to-computer (application-to-application) communication is based conventionally on an interface between the application and the operating system kernel, based on concepts of process or thread and socket. Within the application there is a procedural interface to send (write) and receive (read) on a socket. These are system calls which transfer control to the kernel. Within the kernel, a communications stack, for example TCP/IP, implements a packet protocol that is required to exchange data over a network. The major repetitive actions, after a connection has been established are:                Sending: the kernel determines the connection context represented by the socket data structure. However, only the socket ID, which has an ID space per process, is passed in the system call. The kernel first finds the process ID/process data structure of the current process on receiving the system call. From there it can further locate the socket data structure, in a sense the kernel locates the socket data structure from the socket ID plus the implicit process ID. The kernel then constructs a packet header and copies the application data into the packet payload and queues the packet for sending. Hardware then serves the queue and transmits the packet to the network.        Receiving: the hardware delivers a packet to the kernel; the kernel, after satisfying protocol requirements such as sending an acknowledgement, locates the socket data structure from the packet header. The identity of the destination process is then determined from the socket data structure. The process context then leads to the actual destination memory address in the application space as previously described, and the packet payload is copied there.        
Conventional protocols such as TCP and kernel implementations of these provide the desired reliability, in terms of data communications integrity, and by separating the individual applications from the common system facilities. But it is clear that the amount of work in the kernel to handle each packet transmission at each end of a connection may lead to a significant inefficiency in terms of processing overhead.
More information about operating system kernels and the implementation of multi-process communications such as TCP/IP may be found in, for example, TCP/IP Illustrated, Volume 1: The Protocols, by W. Richard Stevens, Addison-Wesley, 1994, ISBN 0-201-63346-9; Linux Kernel Development Second Edition by Robert Love Novell Press, Jan. 12, 2005, Print ISBN-10: 0-672-32720-1, Print ISBN-13: 978-0-672-32720-9, and TCP/IP Illustrated, Volume 2: The Implementation, by Gary R. Wright, W. Richard Stevens, Addison Wesley Professional, Jan. 31, 1995, Print ISBN-10: 0-201-63354-X, Print ISBN-13: 978-0-201-63354-2, each of which are hereby incorporated by reference in their entirety. In the TCP/IP communications stack, TCP provides application level messaging in the form of a reliable connection oriented protocol with flow control while IP provides connectionless routing of packets, node to node.
The kernel running the communications stack and the applications share the same processors, consuming processor cycles. Any cycles consumed by the kernel to run the standard communications protocols (TCP/IP) and to interface with the applications are cycles that are lost to the applications. In a distributed computing environment such as the high performance computing (HPC) environment, application cycles are at a premium. At the same time, due to the distributed processing nature of the application, a large amount of inter-processor communication with low latency is required. The existing TCP/IP protocol suite for example, provides an elegant and standard method of routing many data streams concurrently. But even when implemented efficiently, it does not meet the super computer requirement of almost instantly placing data sent from an application on one processor into the memory space of an application on a different processor. There exists, therefore, a need for the development of an improved method and system to allow applications in a multi-computer environment to communicate more efficiently.