1. Field of the Invention
This invention relates to networking technology, and more particularly to apparatus and methods for verifying link integrity in fibre channel networks.
2. Background of the Invention
Peer to Peer Remote Copy (PPRC) is a protocol used to replicate a primary storage volume to a secondary storage volume located at a remote site. “Synchronous” PPRC is a configuration wherein each write to the primary storage device is also performed to a secondary storage device. In this scheme, an I/O is only considered complete when the I/O has successfully completed to both primary and secondary storage devices.
When transferring data between primary and secondary storage devices, problems may occur that will cause data transfers to time out. A timeout occurs when the primary storage device sends an I/O command to the secondary storage device but does not receive an acknowledgement signal within a specified period of time. When a timeout occurs, the primary storage device can retry the operation in an attempt to successfully redrive the I/O. When timeouts do occur, they can cause significant performance impacts on a host system that is writing to the primary storage device.
Timeouts may occur for a variety of reasons. They may be the result of transient failures (which may recover quickly without user intervention) or more persistent failures. Transient failures may be the result of network issues such as low bandwidth or latency caused by workload spikes, congestion in a fibre channel network, or temporary slowdowns on a PPRC secondary storage device. More persistent failures may be the result of physical problems on a fibre channel link (e.g., bad cables), hardware problems (e.g., faulty optics), or connection issues along the path (e.g., loose plugs).
There are several known solutions for paths experiencing timeouts. For example, the primary storage device could do nothing and keep sending PPRC I/O down a path that has been experiencing timeouts. This solution may work if the problem is transient in nature. One drawback to this solution is that if the path continues to experience timeouts, the host system will continue to be impacted. Another solution is to stop using the path altogether. This solution may be effective if more paths are available. A drawback to this approach is that a transient failure may cause a path that is otherwise functioning correctly to become unusable. If all paths experience such transient failures, I/O may unnecessarily suspend between the primary and secondary storage devices.
Yet another solution is to configure the primary storage device to report problems to a user while continuing to transmit I/O over the path. This solution relies on the user to take corrective action. However, if the user does not respond quickly and the path is experiencing more persistent failures, the host system will continue to be impacted by the timeouts. Yet another solution is to implement a throttling mechanism to reduce the amount of I/O that is transmitted over a failing path until the path stops experiencing timeouts. The primary storage device may then resume sending a normal amount of I/O. This may reduce the impact on the host system because it will reduce the amount of I/O that will be affected by timeouts. However, timeouts that do occur will still undesirably impact the host system.
Yet another solution is for the primary storage device to periodically “ping” the secondary storage device with a special command (e.g., a fibre channel link service or FCP command). If the pings are successful over a period of time, the primary storage device could resume normal I/O operations. This may be effective if failures on a link are very consistent. However, if link failures are random or inconsistent, there is a good chance that a “ping” would complete successfully, whereas a data transfer would fail. Therefore, this solution also has drawbacks.
In view of the foregoing, what is needed is a self-healing solution that can stop sending I/O down a path that experiences failures or timeouts, while also determining whether failures or timeouts on the path are transient. If the failures are transient, the solution would ideally be able to resume normal I/O on the path when the failures or timeouts end or subside.