1. Field of the Invention
This invention relates generally to communication protocols, and more particularly to lightweight communication protocols for efficiently communicating data between networked computers.
2. Description of the Related Art
The art of networking computers has evolved over the years to bring computer users a rich communication and data sharing experience. As is well known, the Internet has given rise to a new level of sophisticated communication technologies that enable users to share information and communicate via electronic mail with users all over the world. Most of the worlds computers communicate using a well established communication protocol referred to as TCP/IP. TCP/IP is a set of protocols developed to allow cooperating computers to share resources across a network. TCP (the xe2x80x9ctransmission control protocolxe2x80x9d) is responsible for breaking up a message into variable length segments, reassembling them at the other end, resending anything that gets lost, and putting things back in the right order. IP (the xe2x80x9cinternet protocolxe2x80x9d) is responsible for routing individual segments.
As originally designed, the TCP protocol was intended to be a very fault tolerant protocol that could withstand catastrophic failures of the communication network. TCP was also designed with long range communication and messaging in mind. As a result, TCP is inherently a protocol that has high overhead for handling the communication of variable length segments.
As a high level overview, take an exemplary data file that is selected for communication over a network using the TCP protocol. Initially, the TCP protocol will break up the data file into a plurality of variable length segments. Each variable length segment is then packaged with an associated TCP header. An IP header will also be added to the beginning of the packet. The packet data, header, and partial IP header are also processed through a checksum before transmission. The packets are now transmitted over the network, where each packet may potentially travel a different path (e.g., through a plurality of routers and the like) in their journey to a destination. At the destination, the TCP protocol is charged with receiving the packets. However, the packets do not necessarily arrive at the destination in order.
Consequently, the TCP protocol is charged with managing the reordering of the received packets. The managing requires the TCP protocol to keep track of timing parameters and other variables to ascertain whether or not certain packets were lost during transmission. For example, the TCP protocol must calculate round trip times defined by when a packet is sent out and when an acknowledgement (ACK) is received. This timing must be continually monitored using complex timing algorithms and adjusted when necessary. If after a set amount of time no ACK is received, it is assumed that the packet is lost and thus must be resent.
As an example, assume that a sender begins to send packets to a given target. The sender will operate on a timer to determine when certain packets are not received. In some cases, a packet is received, however, the target took too long to acknowledge safe receipt. This situation tends to happen most often as congestion over a network path increases. To handle such situations, TCP utilizes what is referred to as a xe2x80x9cslow start algorithm.xe2x80x9d The slow start algorithm is triggered when congestion reaches a point where packets are not being acknowledged (i.e., when the sender times out) and therefore the sending operation is restarted. The restarting therefore causes significant reductions in throughput performance. More importantly, because the sender is primarily relying on the time out to determine when packets are lost, the slow start algorithm will many times cause a restart when a packet is not necessarily lost. That is, the acknowledgment may just have taken slightly longer than the fixed time out. Consequently, if the acknowledgment was received just after the time out, the slow start algorithm will still cause a resend.
Although this type of processing works, the processing performed for lost packet detection can be computationally intensive. In order to handle the processing of the TCP protocol, the TCP protocol is commonly implemented in software. The handling in software is primarily necessitated to enable, among other things, the detection of lost packets, and the reordering of received packets.
To appreciate the amount of overhead needed to transmit data using standard TCP/IP, the following describes the sending and receiving of data with a SCSI host adapter and a typical NIC. When sending data, the host adapter driver is given a pointer to a buffer of data, which the driver converts to a scatter/gather list of physical memory segments and are passed to the host adapter. From there on, the adapter takes over and transfers all the data without further CPU intervention, posting an interrupt when it is finished with the whole buffer. The TCP/IP stack gets a similar pointer to a buffer, but has to generate headers for encapsulation. Part of that process involves reading every byte to form the TCP checksum. The user buffer is logically broken into 1500 byte (or so) chunks and passed to the driver, one chunk at a time. A good implementation doesn""t actually have to copy anything, it just sends a set of scatter gather pointers to the driver, one for the chunk of data to be sent in a packet, and a couple for the TCP and IP headers. These are converted to physical addresses by the driver, and sent to the NIC, which then transmits the data packet and then requests the next one (either through an interrupt or by pulling the information off of a linked list). The bottom line for a typical 4K data buffer is that a (SCSI command block) SCB and one or two element S/G (scatter/gather) list is passed to the host adapter, while three separate packets, with a total of at least 9 S/G pointers are passed to the NIC, plus every byte of the data and headers has to be read to form the IP checksum.
It is even worse on the receive side, because it is unknown as to which user buffer to place the received data into until the headers are inspected. While some recent NIC chips can separate the TCP/IP header from the rest of the data packet and give the host a few microseconds to tell it where to put the rest of the packet, allowing direct DMA to the user buffer, most implementations put the received data in a driver buffer, then copy the user data to the user""s buffer after inspecting the header. As with sending, a pass over the data and headers is also needed to verify the TCP checksum.
In view of the foregoing, there is a need for a networking protocol that removes the overhead issues produced by conventional TCP. There is also a need for a transport protocol that is optimized for storage and enables fast and efficient utilization in local area networks, wide area networks, and over the Internet.
Broadly speaking, the present invention fills these needs by providing an Ethernet storage protocol (ESP) that streamlines the processing and communication of storage data and removes the overhead associated with prior art communication protocol techniques. Preferably, the ESP is configured to efficiently encapsulate the data with an efficient lightweight transport protocol, and is configured to provide unlimited scaleablity to storage resources. In one preferred embodiment, the ESP is configured to encapsulate SCSI data and communicate the SCSI data over an Ethernet network. The communication is preferably accomplished by enabling host computers and target peripheral devices (e.g., storage drives) to operate using the ESP. It should be appreciated that the present invention can be implemented in numerous ways, including as a process, an apparatus, a system, a device, a method, or a computer readable medium. Several inventive embodiments of the present invention are described below.
In one embodiment, a method for processing storage data that is to be communicated over a network is disclosed. Initially, storage data to be transmitted over a network is provided. Once the data is provided, the method includes serializing the storage data using storage encapsulation protocol headers to generated serialized storage data. Then, the serialized storage data is encapsulated using a simple transport protocol to generate simple transport protocol data segments of the storage data. At this point, each of the simple transport protocol data segments are encapsulated into Ethernet frames. The Ethernet frames can then be communicated over standard Ethernet hubs and switches to enable communication to a selected storage target.
In another embodiment, an Ethernet storage protocol (ESP) enabled network is disclosed. The ESP network includes a host computer having host interface circuitry for communicating data in an Ethernet network. The host interface circuitry is configured to receive parallel data from the host computer provided in accordance with a peripheral device protocol (e.g., SCSI). The parallel data is serialized and encapsulated into Ethernet frames for transmission over the Ethernet network. The ESP network also includes a target having target interface circuitry for communicating data in the Ethernet network. The target interface circuitry is configured to receive the encapsulated serialized parallel data and reconstruct the serialized parallel data into the peripheral device protocol. In this embodiment, the peripheral device protocol is one of a SCSI protocol, an ATAPI protocol, and a UDMA protocol. In addition, the serializing in this embodiment includes attaching storage encapsulation protocol (SEP) headers to portions of the parallel data and attaching simple transport protocol (STP) headers to one or more portions of the parallel data having the SEP headers, such that each STP header defines an STP packet. Thus, transmission over the Ethernet network does not require the TCP protocol, which is large on overhead and inefficient for storage data transfers in a local area network environment.
In yet a further embodiment, a method for communicating storage data over an Ethernet network using a non-TCP lightweight transport protocol is disclosed. The method includes providing data having a peripheral device protocol format, and the data is configured to be communicated over the Ethernet network. The method then proceeds to select portions of the data and attach storage encapsulation (SEP) headers to the selected portions of the data. The method now attaches simple transport protocol (STP) headers to one or more of the selected portions having the SEP headers to produce STP packets. Once the STP packets are defined, the method moves to encapsulate the STP packets into Ethernet frames for communication over the Ethernet network. In this embodiment, the peripheral device protocol format is one of a SCSI format, an ATAPI format, and a UDMA format. Further, each of the STP headers are configured to include at least a handle field, a type field, a length field, a sequence number field, and an acknowledgment field.
In still a further embodiment, a network for efficiently communicating storage data is disclosed. The network includes a cluster server system having a plurality of host server systems, and each of the host server systems have a peripheral interface card for facilitating storage data communication in accordance with an Ethernet Storage protocol (ESP). A storage box is also included having one or more storage peripheral devices. The storage box includes a bridge circuit for facilitating storage data communication in accordance with the ESP. The network is interconnected by way of an Ethernet switch which is configured to connect the cluster server system to the storage box. In this embodiment, the ESP is configured to: (i) select portions of the storage data; (ii) attach storage encapsulation protocol (SEP) headers to the selected portions of the storage data; (iii) attach simple transport protocol (STP) headers to one or more of the selected portions having the SEP headers to produce STP packets; and (iv) encapsulate the STP packets into Ethernet frames for communication over the Ethernet network including the cluster server system, the storage box, and the Ethernet switch. An added advantage of this embodiment is the ability of the ESP to add an IP header after the STP header for communication over an ISO level 3 router or level 3 switch.
In yet a further embodiment, a storage area network (SAN) is disclosed. The SAN includes a server system including one or more host computer systems. Each host computer system includes network interface circuitry and host peripheral interface circuitry. The network interface circuitry is configured to communicate data using a TCP protocol and the host peripheral interface circuitry is configured to communicate data using an Ethernet storage protocol (ESP). A storage box is also provided having one or more storage drives, and the storage box has a bridge circuit for communicating data using the ESP. The network also includes an Ethernet switch for communicating the server system to the storage box. The ESP is configured to: (i) select portions of the data; (ii) attach storage encapsulation protocol (SEP) headers to the selected portions of the data; (iii) attach simple transport protocol (STP) headers to one or more of the selected portions having the SEP headers to produce STP packets; and (iv) encapsulate the STP packets into Ethernet frames for communication over the network including the Ethernet switch. In this network environment, one or more desk top computers may be connected to the network interface circuitry of the server system. The desk top computers have standard network interface cards (NICs) for communicating standard Ethernet frames to and from the server system. The ESP is configured to add an IP header after the STP header of the STP packets for communication over one of a level 3 router and a level 3 switch. In this embodiment, each of the one or more host computer systems can be servers that are not necessarily homogeneous (i.e., each can operate using different operating systems like Windows(trademark) NT, Windows(trademark) 2000, UNIX, Linux, Sun Microsystems Inc. Solaris, etc.). Further, each host computer system can be a cluster if desired. Of course, the cluster will include two or more homogeneous computer systems (i.e., running the same operating system).
The advantages of the present invention are many and substantial. Most notably, the Ethernet storage protocol (ESP) of the present invention simplifies the communication elements needed to transfer data over a network and enables nearly unlimited scaleablity. The ESP preferably implements a simple transport protocol (STP) that requires less CPU processing than conventional TCP. It is estimated that CPU utilization for networks using ESP may be as small as {fraction (1/5)} of networks using TCP. In a more preferred embodiment, the ESP will be implemented primarily using hardware and simple software drivers in order to further limit CPU requirements. The ESP also preferably takes advantage of a storage encapsulation protocol (SEP) which is configured to encapsulate portions of storage data, such as SCSI data, ATAPI data, UDMA data, etc. In communication, senders and targets establish communication sessions by exchanging handles, which are used to identify the senders and targets in subsequent communication transactions. Once a session is open, the session preferably will remain open for the entire time the target and host are connected to the ESP network. Another advantage of the present invention is that Ethernet frames are counted to determine whether packets have successfully been transferred. This is substantially more efficient than prior art techniques utilizing TCP, which rely on byte counting and complicated time-out calculations.
The ESP of the present invention opens up the ability to share large pools of storage utilizing standard Ethernet equipment. The ESP network preferably includes host computers having peripheral interface cards (PICs) and storage targets, each capable of operating the ESP. In one embodiment, the targets can be native ESP devices having circuitry for operating the ESP operations. In other embodiments, off-the-shelf storage drives can be used in conjunction with a bridge circuit that can run the ESP.