Technical Field
Embodiments of the present invention relate generally to data processing system communications and more particularly to a method and system for transmitting an application message between nodes of a clustered data processing system.
Description of the Related Art
It is becoming increasingly common in data processing and computer systems to move from large, monolithic systems towards more modular distributed or clustered systems. This is because distributed systems can, for example, provide advantages in terms of management efficiency and greater performance. They can also give lower entry cost, higher scalability and allow the use of commodity PC (Personal Computer) servers.
Such distributed systems are typically arranged as a data communications network comprising a number of different processing devices (e.g., computers) and peripheral devices such as storage devices which form “nodes” of the network and are interconnected by appropriate communications channels over which they can communicate with each other and exchange messages. In such arrangements, each “node” of the system will also typically include a so-called port or interface adapter that will exchange data, etc., with the system devices making up the node via a local data bus and also control and carry out communication exchanges with other nodes and devices of the system via the communications network. An example of such a data processing system is a communications network-based distributed mass storage system.
In this type of arrangement, communication between nodes over the communications network normally takes place using a known and standardized communications protocol. One commonly used architecture and protocol for such data communications networks is the so-called “Fibre Channel” protocol (see, for example, ANSI X3.303: 1998 which defines the Fibre Channel physical interface).
In a Fibre Channel system, each node of the communications network (e.g., processing or peripheral device) is linked to the network by a Fibre Channel Port which, inter alia, exchanges data with the processing or peripheral device or devices (often referred to as a “host”) of the node via a local data bus. The Fibre Channel Port also includes an interface controller that conducts lower level protocol exchanges between the Fibre Channel communications network and the host processing or peripheral device or devices with which the Fibre Channel port is associated.
Fibre Channel systems also support, and are able to transfer data according to, higher level communications protocols, such as IP (Internet protocol), and the SCSI (Small Computer Systems Interface) protocol (see, for example, ANSI X3.270: 1996 which is an architecture document and SPC2 NCITS.351:2001 which describes SCSI primary commands). The SCSI protocol is, as is known in the art, commonly used for communications such as read and write commands from a host processing device (e.g., computer) to a peripheral storage device. Indeed, the presently predominant communications protocol and network architecture for distributed storage systems is SCSI over Fibre Channel (referred to as “FCP” (and defined in, inter alia, ANSI X3.269:1996)). In such arrangements, the higher level communications protocol such as the SCSI protocol is implemented on top of the Fibre Channel protocol.
When using distributed systems involving communications networks, it is important to have an efficient communication mechanism for exchanging the messages that applications of the system might need to send between nodes and devices of the system to carry out the useful functions of the system, such as the information that must be exchanged to achieve those useful functions. Examples of such application messages would include messages such as request, grant, lock, invalidate, etc., messages that might be exchanged in a distributed system to access and manipulate metadata relating to a set of data (or the data itself), for example, for flash copy functions, such as to determine whether a set of data has been flash copied or locked, and/or more generally the messages that the system's control algorithms will use to control and carry out the useful functions of the system. It will be appreciated by those skilled in the art that such application messages should be distinguished from the lower level commands and protocol messages, such as an indication to the receiver to expect an application message, that may also be exchanged between nodes of the network to control the sending of the application message itself.
The issue of efficient application message exchange is exacerbated in distributed storage systems, because the messaging overhead budget is often measured in tens of microseconds.
One way to enhance the efficiency of such application message exchange in a distributed system would be to use an upper level communications protocol which is designed more for messaging, such as the Virtual Interface (VI) protocol for such application messages. However, not all existing communications protocol ports and adapters, such as Fibre Channel adapters, will support such additional communications protocols. It would also be possible to use a separate network within the distributed system which is more optimized for messaging, such as an InfiniBand network, but while that may give better performance, it would carry the increased cost of an additional communications network needing to be added to the system.
It is also known to use the existing communications network and protocol of the distributed system to exchange application messages between nodes (and hence devices) linked by the communications network. For example, existing communications protocols such as Fibre Channel and SCSI over Fibre Channel support the “writing” of data from one network node to another. It is possible therefore to use this “write” process to “write” an application message to the intended receiver over the communications network.
In such an arrangement, the message originator would issue a write (or similar) command protocol message (i.e. a command that it wishes to transfer “data” to the intended receiver) to the intended receiver of the application message, with the system then operating subsequently as for any other write operation. Thus, for example, in a Fibre Channel based system operating in this manner, upon receipt of the write command, the receiver would be interrupted, inspect the write command, allocate memory space to receive the intended application message and then return a “transfer ready” protocol message to the application message originator. The application message originator would then transfer the application message, with the receiver again being interrupted to inform it of the successful application message transmission. The receiver would then complete the write command, and notify the application message originator of the application message completion, at which point the application message initiator can release the resources associated with the application message.
An example of such an arrangement is the use of a SCSI SEND command to send an application message between two SCSI ports that include processing devices. In such an arrangement, the lower levels of the SCSI implementation of the SEND command (for both software and hardware) are the same as for a SCSI WRITE command.
Such arrangements take advantage of the existing communications protocols and hardware used in the network and can work satisfactorily, since, for example, in the case of a SCSI system, much of the protocol message processing can be performed in custom hardware in the SCSI adapter, thereby freeing the main host system processor to do higher level tasks. Furthermore, in most readily available SCSI attachment adapters, the application message issuer will only process the initial write (send) request and the final completion protocol message (although the receiver has a little more to do, including handling the initial receipt of the write command, the setup for the application message transfer (such as preparing a memory location for the application message), the notification that the message transfer is complete and the transmission of the final completion protocol message (and any associated “tidying up”)).
However, the Applicants have recognized that a drawback with this type of arrangement is that the application message transmission time is delayed by the initial “handshaking” that is required. The biggest delay will typically be in the processor handling at the receiver end to set the receiver up for the receipt of the application message, although there may also be some significant round trip delay in the network fabric itself. For many IO (Input/Output) operations, such as those associated with write caching, such processing delay or latency is a key determining factor in the performance perceived by a large class of applications (and as such is undesirable).
Thus the Applicants believe that there remains a need for an application message transmission and receipt process in distributed data processing systems that can reduce latency and/or the total overhead in application message handling.