This invention relates generally to networked communications and, more particularly, relates to network communications between computer applications using different network transport providers.
Computer networking allows applications residing on separate computers or devices to communicate with each other by passing data across the network connecting the computers. Traditional network media, such as Ethernet and ATM, are not reliable for application-to-application communication and provide only machine-to-machine datagram delivery service. In order to provide reliable application-to-application communication, transport protocol software run on the host machine must provide the missing functionality.
Typically, the protocol software for network communication is implemented as a combination of a kernel-mode driver and a user-mode library. All application communication passes through these components. As a result, application communication consumes a significant amount of the host processor""s resources and incurs additional latency. Both of these effects degrade application communication performance. This degradation significantly limits the overall performance of communication intensive applications, such as distributed databases.
Recently, a new class of communication interconnects called System Area Networks (SANs) has emerged to address the performance requirements of communication intensive distributed applications. SANs provide very high bandwidth communication, multi-gigabytes per second, with very low latency. SANs differ from existing media, such as Gigabit Ethernet and ATM, because they implement reliable transport functionality directly in hardware. Each SAN network interface controller (NIC) exposes individual transport endpoint contexts and demultiplexes incoming packets accordingly. Each endpoint is usually represented by a set of memory-based queues and registers that are shared by the host processor and the NIC. Many SAN NICs permit these endpoint resources to be mapped directly into the address space of a user-mode process. This allows application processes to post messaging requests directly to the hardware. This design consumes very little of the host processor""s resources and adds little latency to communication. As a result, SANs can deliver extremely good communication performance to applications.
Most distributed applications are designed to communicate using a specific transport protocol and a specific application programming interface (API). A large number of existing distributed applications are designed to utilize the Transmission Control Protocol/Internet Protocol (TCP/IP) suite and some variant of the Berkeley Sockets API, such as Windows Sockets.
In general, each SAN implementation utilizes a custom transport protocol with unique addressing formats, semantics, and capabilities. Often, the unique capabilities of a SAN are only exposed through a new communication API as well. Since existing applications are usually designed to use one primary transport protocol and APIxe2x80x94most often TCP/IP and Socketsxe2x80x94there have been relatively few applications that can take advantage of the performance offered by SANs. In order for existing applications to use a SAN, the TCP/IP protocol software must currently be run on top of it, eliminating the performance benefits of this media.
In order to provide the performance benefit of SANs without requiring changes to application programs, a new component is inserted between the communication API used by the application, e.g. Windows Sockets, and a SAN transport provider. This new component (hereinafter network transport switch) emulates the behavior of the primary transport provider that the application was designed to utilize, e.g. TCP/IP, while actually utilizing a SAN transport provider to perform data transfer. In situations where the SAN transport provider is not suitable for carrying application communication, e.g. between sub-networks of an internetwork, the network transport switch continues to utilize the primary transport provider. A mechanism is provided within the switch for automatically determining whether to utilize the primary transport provider or alternative transport provider.
One example of this approach is described in a paper titled xe2x80x9cSCI for Local Area Networksxe2x80x9d by Stein Jorgen Ryan and Haakon Bryhni, ISBN 82-7368-180-7 (hereinafter SCILAN). Another example is described in a paper titled xe2x80x9cHigh Performance Local Area Communication with Fast Socketsxe2x80x9d, by Steven H. Rodrigues, Thomas E. Anderson, and David E. Culler, in Proceedings of Usenix Annual Technical Conference, 1997 (hereinafter Fast Sockets).
The SCILAN architecture provides for utilization of an alternative transport provider for communication between applications residing on computers systems connected to an SCI network. A known IP address range is assigned to the SCI network. If an application uses an address in this range to identify another application with which it would like to communicate, then the alternative transport provider is used. If an address is specified from a different range of the IP address space, then the standard TCP/IP provider is used. Note that in this architecture, the TCP/IP provider must use a separate physical network from the SCI network.
Fast Sockets also provides for utilization of an alternative transport provider for communication between applications residing on computer systems connected to a system area network. When an application tries to establish a connection, Fast Sockets applies a hash function to the destination TCP port address in order to obtain an alternative port address. Fast Sockets then tries to establish a connection to the alternative port address using TCP/IP. If this connection attempt succeeds, Fast Sockets uses the connection to negotiate a separate connection over the alternative transport provider. If the first connection attempt fails, Fast Sockets establishes a connection to the original port address supplied by the application using TCP/IP. When an application issues a request to listen for connections on a specific TCP port address, Fast Sockets applies the hash function to the address supplied by the application and then listens on both the requested port and the generated alternative port. This approach requires that two connection attempts be made regardless of whether TCP/IP is ultimately used to carry the application""s data. This approach also overloads the TCP port address space and will fail if the alternative port address generated during a connection attempt is already in use by another application.
In order to emulate the data transfer behavior of the primary transport provider when utilizing an alternative transport provider, a network transport switch must implement a protocol that controls the transfer of data from source memory buffers supplied by a first application into destination memory buffers supplied by a second application. This aspect of data transfer is known as flow control.
The TCP/IP protocol provides for data transfer in the form of an unstructured stream of bytes. It is the responsibility of the applications using the TCP/IP protocol to encode the data stream to mark the boundaries of messages, records, or other structures. The Berkeley Sockets and Windows Sockets communication APIs offer applications a great deal of flexibility for receiving data. Applications may request to receive data directly into a specified memory buffer, request to receive a copy of a prefix of the data directly into a specified buffer without removing the original data from the byte stream (peek), or request to be notified when data is available to be received and only then request to receive the data or peek at it. Since TCP/IP provides an unstructured byte stream, an application may request to receive data from the stream into a specified memory buffer in any size portion, e.g. a single byte or thousands of bytes. The flexibility of these communication APIs and the unstructured nature of the TCP/IP data stream make it difficult to implement a flow control protocol that works efficiently for all applications.
The present invention provides an improved network transport switch to enable applications designed for a primary transport provider to use one of a plurality of alternative transport providers that offer some benefit over the primary transport provider, such as higher performance. When an application or a device attempts to communicate with another application or device, the network transport switch determines whether to use the primary transport provider or one of the alternative transport providers to carry the communication. A table of network addresses supported by alternative providers is automatically constructed and maintained. The switch compares the network addresses of the sending and receiving application or device to the table of network addresses to determine whether they both are on one of the alternative network interconnect systems. If the applications or devices are attached to the same alternative network interconnect system, the switch establishes communication directly through the alternative transport provider for that alternative network interconnect system. If the two applications are attached to different networks, then the switch utilizes the primary transport provider. When the switch establishes communication through an alternative transport provider, it emulates the semantics of the primary transport provider such that the communicating applications are unaware that an alternative transport provider is in use.
When using an alternative transport provider, the network transport switch achieves improved data transfer performance by applying an adaptive flow control protocol that adjusts its data transfer strategy based on the behavior of the communicating applications. The switch monitors the receiving application to determine when the receiving application posts buffers to receive the data and also detects the size of the buffers and then changes the way it directs data to be transferred between the applications based on when buffers were posted and buffer size. Large data blocks are transferred using remote direct memory access transfers if the receiving application""s receiving buffers are of sufficient size or through messages if the receiving buffers are not large enough. Through this adaptive mechanism, the network transport switch attempts to maximize the communication bandwidth and minimize the communication latency observed by the communicating applications.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments that proceeds with reference to the accompanying figures.