1. Field of the Invention
The present invention relates to transparent access to network attached storage devices, whether configured as SCSI over IP, NAS or NASD devices. In particular, the present invention provides a method and device for using a switch as a virtual storage device, with the advantage that physical devices can be added to, replaced on or removed from a network without reconfiguring network clients or applications running at levels above the network clients.
2. Description of Related Art
There is a trend towards use of cluster devices on networks to improve performance, fail over, load-balancing, robustness and other characteristics of network devices. In a cluster device, multiple network devices share the workload of what was originally handled by one device, increasing capacity and scalability while minimizing vulnerability to a single point of failure or a single bottleneck. Transparency can be achieved when addresses are available via mechanisms such as round-robin domain name service (xe2x80x9cDNSxe2x80x9d), where the cluster shares a single fully qualified domain name (xe2x80x9cFQDNxe2x80x9d) and the name resolution process returns different IP addresses for devices sharing the same FDQN. Protocols such as dynamic host configuration protocol (xe2x80x9cDHCPxe2x80x9d) have been widely adopted for allocating available addresses. However, allocation of addresses as an approach to transparency requires available addresses and DNS services. It also may require modification network client software and special or modified application software.
Sometimes it is desirable for backward compatibility with older network clients or due to lack of available addresses for cluster of devices to share one virtual IP address. To maintain transparency when cluster devices share the same virtual IP address, the network client must believe that the transport session endpoint is the virtual IP address. The client must address a single logical device without being aware that there are multiple physical devices. It is desirable for the network transport session to be able to change the device or endpoint within a cluster which is communicating with the client without the client being aware of the change. Having a technology that facilitates such transparency, without any need to change existing client software or IP stacks, may significantly increase the rate of introduction of new clustering technologies. Transparency technology also may facilitate the development of wireless systems.
For networks relying on the transport control protocol (xe2x80x9cTCPxe2x80x9d), there are three logical approaches for supporting cluster devices. One approach is to replay transport connections from one device to the next. The second approach is for the server to instruct the client software to use a specific device within a cluster. Finally, there is the approach to the present invention, to handoff connections transparently among devices within a cluster. A disadvantage of the replay approach is that it generates additional traffic and introduces latency. A disadvantage of the EAP approach is that it requires potentially significant structural changes to the IP stacks in the client and the use of IP options. In essence, the IP stack must be changed so that it understands the existence of a cluster and distinguishes among devices within the cluster. The handoff approach avoids these problems.
Handoffs clearly have benefits when working with clustered systems, server area networks, network attached storage, and other similar to loosely distributed models. Handoffs allow the systems to appear as a virtual IP host through which the transport connections are directly forwarded to the node being utilized; other nodes in the system are not affected. Resource utilization is more efficient and transparent fail over is more easily accomplished. Handoffs may help solve problems with address transparent leases, as in the proposed IP version 6 re-numbering. Handoffs also may aid servers in communication with a network address translation (xe2x80x9cNATxe2x80x9d) device, if the NAT is performing a cluster-like role.
A variety of network devices may benefit from virtual IP addressing. Disk drives with built-in file systems, sometimes referred to as network attached storage devices (xe2x80x9cNASxe2x80x9d), are one type of the device the would benefit from or function as a cluster. Web servers, database servers, networked computing clusters and load balancing servers also may benefit from virtual IP addressing. In general, any type of network device that would benefit from a cluster being addressed by a single virtual IP address may benefit from transparency technologies. Virtual addressing can be cascaded, so that a virtual IP cluster may appear as a single address within another virtual IP cluster. Network attached storage is prominent among the variety of network devices that may benefit from the present invention.
Network Attached Storage is a storage paradigm in which disks are detached from the server and placed on the network. Ideally, the server is removed from the datapath between client and data. The goal of a NAS system is to increase the overall performance of the system while reducing the total cost of ownership (TCO). New functionality, such as the appearance of infinite disk capacity and plug-and-play configuration, can be incorporated. Improved performance and functionality at a reduced TCO are made possible by the ever increasing disk and switch device capabilities. These capabilities allow offload of processing from a centralized server to smarter devices and possible elimination of the server itself.
Three different strategies can be pursued to develop a NAS solution. The ultimate strategy would be a serverless network attached storage. The strategy names are related to the client-disk datapath; Strategy 1xe2x80x94Server-centric NAS; Strategy 2xe2x80x94Serverless NAS; and Strategy 3xe2x80x94Master/Slave NAS.
A traditional file system is managed as a client/server system. The client accesses the server which has all the required disks integrated with the server. The storage is referred to as Server Integrated Disk (SID) storage. The server-centric strategy has begun migrating from the Server Integrated Disk model to use of internal (SCSI) communication paths across a network. The new model allows disks to be arbitrarily placed on the network. It relies on a form of networked SCSI (SCSI over IP) to attach the disks logically to the server. In this context, SCSI over IP is used in an inclusive sense, with the NetSCSI being a particular research implementation by the University of Southern California Information Sciences Institute of SCSI over IP. This scheme is referred to as Server Attached Disk (SAD). However, the Server Attached Disk model is not expected to yield performance gains. Its gains are expected in ease of use and total cost of ownership.
A xe2x80x9cserverlessxe2x80x9d system is not truly serverless as some central point-of-control needs to exist to reduce system complexity, however the xe2x80x9cserverxe2x80x9d may require insignificant resources. In a serverless system, the system""s steady state is direct client and disk communications. Server interaction is an insignificant percentage of the communications. In the serverless approach, the central control point when required, would be a switch. A serverless approach requires significant changes to client systems, which is a major barrier to acceptance. A serverless system has been suggested by research work at Carnegie Mellon University (CMU) on Network Attached Secure Disks (NASD). In this context, NASD is used in an inclusive sense, to include Object Based Storage Devices (OBSD) and the proposed SCSI-4 standard. Carnegie Mellon""s work also includes overlay systems to ensure backwards compatibility with existing networked file system protocols such as Sun""s Network File System (NFS) and Microsoft""s Common Internet File System (CIFS). Research results indicate high potential for system scalability. Use of a file overlay system, however, tends to defeat performance gains from the switch-based architecture.
NAS benefits from maintaining transparency to the client in how the back-end system (formerly a server) is implemented. The back-end system should appear as if it is one virtual host when, in actuality, it is composed of a number of different devices.
One aspect of the present invention is a method for handing off TCP sessions in a system including a client in communication with a switch and two or more devices. This method includes determining in the first device that a handoff should take place, identifying a second device to take over the session, sending handoff messages to and receiving an acknowledgment from the second device, and reporting the handoff to and receiving an acknowledgment from the switch. The devices applying this method may be disk drives, web servers, database servers, networked computing clusters, load balancing servers, switches or first devices which aggregate second devices or any other device that benefits from being clustered. The determination that a handoff should take place may be based on the location of the data being processed in the TCP session, whether data returned to the client from a location other than the first device has reached a predetermined threshold, whether data being returned to the client from an other device has reached a predetermined threshold, an evaluation of the relative amount of data being returned to the client from the first device and one or more other devices, or an evaluation of the workload of the first device and other devices. Evaluations may be based on the TCP session of immediate concern or on more than one TCP session involving the first device. Other evaluation criteria will be obvious to one of ordinary skill in the art. Determination of which device should receive the handoff will depend upon a technique adapted to the nature of the devices in the cluster. For disk drives, a virtual root directory may be used. For databases, a table of database objects, transaction types, or database subsystems may be used. These techniques may be applied by the first device or by a third device in communication with the first device.
Another aspect of the present invention is that one of the handoff messages may include a kind field, a client port identification and a client IP address, preferably in the form of a TCP option. In addition, a sequence number and an acknowledgment number may also be passed as a TCP option. The TCP state machine running on the first and second device may be modified with additional states to take into account handoffs and half-handoffs. A Set flag may cause transitions in TCP states on both the first and second devices.
Another aspect of the present invention is that predictive setups may be used to reduce the latency time for a handoff. According to this aspect of the invention, one or more handoff preparation messages may be sent by the first device to other devices and a handoff destination selected from among the devices which acknowledge the handoff preparation messages.
Yet another aspect of the present invention is that only half of the TCP session may need to be banded off. In other words, the first device may allocate to a second device either sending messages to or receiving messages from the client, without allocating both roles.
The present invention may be practiced as either a method or device. A device according to the present invention may comprise: a switch including logic for routing messages among a client and a plurality of devices; a logic responsive to an instruction to reprogram its routing messages and to confirm the road reprogramming is complete; a first device including logic to determine when a TCP session should be handed off to another device, logic to instruct a second device to accept a handoff, and logic to instruct a switch to reprogram its routing of messages; wherein the second device is in communication with the switch and includes logic responsive to an instruction to accept a handoff and to confirm acceptance of the handoff. The first and second devices may be disk drives, web servers, database servers, networked computing clusters and load balancing servers or any other device that benefits from being clustered. Logic may be included to determine when to handoff a TCP session and to identify a second device to receive the handoff, consistent with the method of present invention.
The present invention also includes a method of virtually addressing a plurality of storage devices through a switch This method includes establishing a TCP session between a client and switch, wherein the switch appears as a virtual storage device. The switch selects one of a plurality of storage devices to participate in the TCP session and the logic of the switch is programmed to forward packets to the selected storage device. According to one aspect of this invention, the client includes TCP logic to participate in the TCP session and this logic is not need be modified in order for the client to recognize the switch as a virtual storage device. The virtual storage device may appear to be one of a variety of storage devices in accordance with any of several protocols, including SCSI over IP, NAS or NASD. The switch may inspect a TCP session packet and read information beyond the TCP/IP header for purposes of selecting a storage device to participate in the established TCP session. If the switch includes a file directory, the file directory may be accessed based on inspection of one or more packets. A storage device may be selected based on the files contained on the storage device or other characteristics of storage device. Once the TCP session is underway, the selected storage device may determine that a different storage device should participate in the session. The present in invention provides for handing off the TCP session and reprogramming switch to forward packets to the other storage device, transparently to the client and its TCP logic.
A significant aspect the present invention is that switch is configured so that virtual storage devices can be cascaded. That is, one or more of the plurality of storage devices coupled with the switch may be another switch configured to appear as a virtual storage device.
The present invention also allows for aggregation of storage devices connected to a network. The method for aggregation storage devices includes inserting a switch between the storage devices and the network, wherein the switch appears to be a virtual storage device. The switch accepts requests to establish file sessions between clients and storage devices, generally in accordance with the method described above. Both the method for virtually addressing a plurality of storage devices and the method for aggregation of storage devices can be embodied in any the device having a storage medium and a processor connected to the storage medium, storage medium storing a program for controlling the processor and the processor being operative to carry out the methods described above.