1. Field of the Invention
The present invention relates, in general, to efficient transmission of data including control information between devices across a network, and, more specifically, to a data transfer protocol useful in data replication that enables storage controllers linked via networks such as a storage area network (SAN) to work in a “tightly coupled” fashion with extremely high efficiency.
2. Relevant Background
Recent years have seen a proliferation of computers and storage subsystems. Demand for storage capacity grows by over seventy-five percent each year. Early computer systems relied heavily on direct-attached storage (DAS) consisting of one or more disk drives coupled to a system bus. More recently, network-attached storage (NAS) and storage area network (SAN) technology are used to provide storage with greater capacity, higher reliability, and higher availability. The present invention is directed primarily SAN systems that are designed to provide shared data storage that is beyond the ability of a single host computer to efficiently manage.
Mass data storage systems are implemented in networks or fabrics that provide means for communicating data between systems that use data, and the storage systems that implement the physical storage. In many cases, host computers act as storage servers and are coupled to the network and configured with several disk drives that cumulatively provide more storage capacity or different storage functions (e.g., data protection) than could be implemented by a DAS system. For example, a server dedicated to data storage can provide various degrees of redundancy and mirroring to improve access performance, availability and reliability of stored data. Collecting storage sub-systems, where a separate server manages each sub-system, can form a large storage system. More recently, virtualized storage systems such as the StorageWorks® Enterprise Virtual Array announced by Compaq Corporation in October, 2001 provide storage controllers within a fabric or network that present virtualized storage to hosts that require data storage in a manner that enables the host to be uninvolved in the physical configuration, allocation and management of the storage devices. StorageWorks is a registered trademark of Compaq Computer Corporation in the United States and is a trademark or registered trademark in other countries. In this system, hosts simply access logical units of storage that appear to the host as a range of logical address space.
SAN systems enable the possibility of storing multiple copies or “replicas” of data at various physical locations throughout the system. Data replication across multiple sites is desirable for a variety of reasons. To provide disaster tolerance, copies of data stored at different physical locations is desired. When one copy becomes unavailable due to equipment failure, a local network outage, natural disaster or the like, a replica located at an alternate site can allow access to the data. Replicated data can also theoretically improve access in normal operation in that replicas can be accessed in parallel, avoiding bottlenecks associated with accessing a single copy of data from multiple systems.
In SAN systems, the storage controllers at various sites communicate with each other using a common data transfer protocol to coordinate storage and management activities at various sites. The data transfer protocol is key to maintain performance as well as proper ordering in a multi-volume, multi-target replication environment. Typically it is difficult for a protocol to provide either performance or guaranteed ordering, but in replication applications both are required.
The most popular protocol in SANs is the small computer systems interface (SCSI) protocol family. SCSI is well established as an efficient protocol for block data transfers between host computers and storage devices. To extend the range of SCSI, fibre channel is used in SANs to provide a high-speed, long-distance data communication mechanism. Because fibre channel standards accept SCSI as a transport layer protocol, SCSI is the protocol of choice in most SAN implementations. In data replication systems, however, SCSI contains several inefficiencies that impact performance in communication between storage controllers in a SAN. While fibre channel defines several other transport protocols that substitute for SCSI, in general these other protocols share the limitations of SCSI.
Older methods, for example SCSI, were slower due to the inherent attempt to have the initiator control all data flow. Data flow control involves implementing the data structures and processes that execute, for example, write and copy operations to remote locations while ensuring that the order in which operations are executed retain data integrity, as well as detecting and responding to error conditions. This centralization in a single controller creates a bottleneck in that the initiator storage controller performs the lion's share of data replication tasks while resources in a target controller were underutilized. For example, the target is not involved in the processes of ordering and command completion, forcing the initiator to send all the data to the target and manage the ordering and command completion operation.
SCSI over Fibre channel uses a command/response message protocol to send packets or frames of information between a device associated with a source identifier (S_ID) and a device associated with a destination identifier (D_ID). More specifically, a write operation in a SCSI protocol includes a command phase and a data phase. The command phase establishes an “exchange” which defines specific buffers in the transmitting and receiving devices for storing communicated data frames. A SCSI write cycle begins by a first device sending a command information unit (IU) from the originating device to the responding device in the command phase, which identifies an originator exchange identification (OX_ID) pointing to the buffer on the originating device that is dedicated to the write operation. Various metadata about the write operation is included in the command IU is used to set up corresponding buffers in the responding device. These corresponding buffers are assigned a responder exchange identification (RX_ID) which is transmitted back to the originating device in a response IU. Only after both devices know the OX_ID/RX_ID pair that defines the exchange can the devices can send the actual data that is subject of the operation in the data phase. Once all the data has been transmitted, the responding device sends a response message indicating status of the write operation.
This exchange, therefore, includes at least three overhead frames in addition to the actual data frames in order to complete the transaction. Because an exchange is set up and broken down frequently, these overhead costs are incurred frequently. Significantly, two of these overhead frames must be exchanged before any data can be transferred. In a network with high latency, this delay caused by the initial set-up of the exchange not only increases the latency required to perform each operation, but also increases the resources required (e.g., buffer memory) to hold data before transmission.
SCSI over Fibre channel standards include mechanisms for ensuring in-order delivery of packets such that when each packet is transmitted, and the recipient device will generate an acknowledge message for each command that is successfully performed. For example, a command message may contain a write command, various header information indicating the source and destination of the write command, metadata and state information, and data to be written in a payload section. A storage node that is the designated destination will perform the write and send an acknowledge message addressed to the source ID of the command message.
The fibre channel protocol works well in applications where a single stream of data is involved such that each frame can contain a large amount of data. In such applications, the overhead associated with sending acknowledge packets for each frame is tolerable. However, in a data replication application a channel between two devices may carry multiple streams. In such cases, the overhead associated with the acknowledge packet sent for each transmitted frame is significant and the protocol becomes inefficient.
In any data communication protocol, the ability to detect and react to unsuccessful transmission (e.g., lost frames) is important. SCSI is relatively slow to detect some kinds of lost information. SCSI was particularly designed to operate over a data bus, not a network, and so is better suited for small, predictable latency in the communication channel. In a SAN, however, the latency may be long and somewhat variable, making it difficult for SCSI to detect when a frame has not been delivered.
A data replication protocol must also respond to connection or link failures in a manner that preserves the integrity of the operations being communicated, and quickly adapts to the failure to perform the desired transaction. For example, when a link fails before a write or copy transaction has been committed to an alternate site, the SCSI protocol cannot readily transfer the transaction to another communication link. Instead, a failure condition is reported and the transaction must be repeated once a new connection is established. A need exists for a system that enables data transactions to be re-routed in-flight in response to link/connection failure conditions.
Link congestion is a condition similar to link failure described above. An operable link that is carrying too much traffic will increase the latency required to deliver frames and increases the likelihood that frames will be dropped. Increased latency also increases the demands on processor and memory resources required to perform the operations as, for example, more and larger buffers are required to hold the larger number of in-flight transactions. A need exists for a system that enables data transactions to be re-routed in-flight in response to link/connection congestion.