Cluster based architectures such as that shown in FIG. 1 are commonly used for high performance I/O and storage systems. In such architectures, each “node” 102 in the cluster 100 provides an access point into storage 104, and storage content is cached and distributed across nodes according to some placement method. Co-pending U.S. application Ser. No. 11/365,474, commonly owned by the present assignee and incorporated by reference herein in its entirety, dramatically advanced the state of the art by providing a high-performance and highly-scalable caching solution with a cluster-based architecture. However, certain opportunities for improvement remain.
For example, in a client-server or initiator-target model (for example a NAS filer), it is considered desirable to allow a client to connect to any node and be able to access any content from storage regardless of its placement among the nodes in the cluster. One common method of making this possible in IP based clusters 100 is sometimes referred to as a TCP/IP “handoff operation” or TCP/IP “connection migration” in which the TCP/IP connection is migrated to the node actually executing the I/O, transparently to the connecting client.
A connection migration operation is illustrated in more detail in FIGS. 2A to 2C. As shown in the illustrative example of FIG. 2A, when client 206 first makes a TCP connection with cluster 200, node 1 handles the connection, and TCP packets are sent back and forth between client 206 and node 1. In a basic TCP connection migration shown in FIG. 2B, for example to allow a different node in cluster 200 to handle an I/O request associated with connection, the TCP connection is migrated from a “target” node (i.e. the original connection node, Node 1 in this example) to a “slave” node (i.e. the migrated node, Node 3 in this example). Any TCP packets sent by client 206 are then forwarded by target node 1 to slave node 3, and slave node 3 directly sends TCP packets related to the connection to client 206. The connection migration is completely transparent to client 206. TCP packets being sent by node 3 to client 206 are specially written to appear as if they were sent from node 1. It should be noted that more than one migration can occur during the active lifetime of a given TCP connection. This is shown in FIG. 2C, where the connection is re-migrated from an “inactive slave” node (Node 3 in this example) to a new “slave” node (Node 5 in this example). Any TCP packets sent from the client 206 are then forwarded internally by target node 1 to slave node 5, and slave node 5 directly sends TCP packets related to the connection to client 206. The TCP connection can also migrate back to and away again from the target node.
Although the prior art discloses certain mechanisms for performing TCP/IP connection migrations such as those described above, there are many challenges to efficiency and performance arising from such operations that are not appreciated and/or adequately addressed by the prior art.
For example, lost packets can occur. More particularly, TCP packets sent by a client can be lost during connection migration and re-migration operations. This can happen when packets are temporarily sent by the target node to a wrong slave node, which ignores it and/or drops it. When a packet is lost, the TCP stream can be slowed due to the need for re-transmission, which can adversely impact performance.
Another potential source of inefficiency is identifying TCP connections. For example, TCP connections are traditionally uniquely identified by a 4-tuple including the source and destination IP addresses and ports. The 4-tuple structure is sufficient to uniquely identify connections between two parties, as is the case with normal TCP. However, when TCP migrations take place, there are at least three parties participating in the connection: the client, the target node and one or more slave nodes. In this scenario, the original 4-tuple is not sufficient to uniquely identify a TCP connection to all parties. Accordingly, there is a need to include additional identifiers to support large numbers and/or simultaneous connection migrations in order to avoid conflicts. This is not possible with conventional connection mechanisms.
A still further potential source of inefficiency is managing client-visible TCP timestamps. The local clocks in the target node and slave nodes may not be completely time synchronized. Meanwhile, the client expects coherent and monotonically increasing timestamps across migrated TCP connection. For example, these timestamps are used in congestion control algorithms that impact performance. If the timestamps seen by the client are not monotonically increasing, performance can suffer and the client may choose to end the connection. One possible solution is to synchronize the clocks on every node that participate in the same connection. So at any time, a timestamp value based on the local clock on every machine can be in sync and useful to both sender and receiver. However it is very hard to achieve very fine grained cluster wide time synchronization in an accurate and reliable way using only software approaches.
Accordingly, a need remains in the art for mechanisms that allow for more efficient delivery of data in a cluster-based architecture.