The present invention centres on the interaction of two functions in this system—the forwarding layer and the cache used to provide write buffer resource—and how those functions handle I/O requests. A description of these is needed for an understanding of the invention.
The forwarding layer allows an I/O request to be received on any node in the system, and for that request to be forwarded to another node that will actually be responsible for servicing that request. In systems which can scale to include many nodes, this technique is commonly used to allow the work of the whole system to be shared among the member nodes, and to allow each of the nodes to only be concerned about a subset of the work of the whole system. This technique allows simpler algorithms to be used, and these algorithms also tend to scale to be operable in bigger systems more readily. Contrast this with algorithms that allow any node in the system to process any request, particularly where those requests need to be processed coherently with respect to other requests received on other nodes of the system.
When handling a forwarded I/O request, the forwarding node generally still remains involved in the I/O process. In particular the forwarding node is still responsible for performing the data transfer to/from the host, and sending completion status to the host, even though the forwarded-to node is the source and/or sink of that data and status, according to its handling of the I/O request. It is sometimes possible to hand-off the request entirely, so that once the request is forwarded, the forwarding node has no further responsibility towards it, and the exchange becomes one purely between the request originator and the forwarded-to node. But this feature is not always possible, because of constraints imposed by the fabric infrastructure connecting the originator hosts and the forwarding/forwarded-to nodes, and/or constraints in the adapter technology that interfaces the forwarding node with that fabric.
The process for a write command in particular requires the forwarding node to request a transfer of the data from the host into a buffer within that node, and then transmit the contents of that buffer to a further buffer within the forwarded-to node. One scheme for achieving this transfer involves the following steps (with reference to FIG. 2):    200. Host transmits I/O write request to first node    202. First (forwarding) node forwards request to second (forwarded-to) node    204. Second node decides to process, allocates buffer in which to receive data, and sends request for data to first node    206. First node allocates buffer, and sends request for data to host    208. Host transmits data, and data is received in first node in buffer defined at 206    210. First node is notified of completion of data transfer, and starts data transfer to second node in buffer defined at 204    212. Second node is notified of data transfer completion, and resumes processing of write I/O request using received data
Note that the pre-allocation of buffers into which to receive data is an important requirement of operation in a storage network, such as one based on FibreChannel. Note also that these buffers are relatively expensive, which means they need to be explicitly assigned to an I/O request as it is processed, rather than being presumed to be available. Hence, in the sequence above, the host does not transmit the write data with the request at 200; instead it waits until it is asked for the data at 206. Similarly, the forwarding node does not send the data until the forwarded-to node asks for it. This behaviour helps to prevent congestion arising in the fabric, where data is transmitted but cannot be received because of a lack of buffering at the receiver, and is an important feature that tends to distinguish how data transfers are performed within storage networks from how they are performed in conventional ones.
One consequence of the scheme above though, is that the whole I/O process involves more steps, and takes longer from start to finish, as compared to the equivalent process where the I/O is handled entirely within the first node, comprising the following steps (with reference to FIG. 3):    300. Host transmits I/O write request to first node    302. First node decides to process, allocates buffer in which to receive data, and sends request for data to host    304. Host transmits data, and data is received in first node in buffer defined at 302.    306. First node is notified of completion of data transfer, and resumes processing of write I/O request using received data
The extra ready for data exchange can have a significant impact on the total processing time experience by the host, possibly as much as trebling the time it has to wait for the I/O request (as compared with the local processing case), and this can have a significant cost in terms of overall system performance.
The following sequence of steps can be used to mitigate this extra processing time (with reference to FIG. 4):    400. Host transmits I/O write request to first node    402. First node allocates buffer, and sends request for data to host    404. Host transmits data, and data is received in first node in buffer defined at 402    406. First (forwarding) node forwards request with data to second (forwarded-to) node    408. Second node processes I/O request using the received data
The above sequence avoids an extra exchange of messages between first and second nodes to effect the data transfer during the I/O process, which significantly improves the situation compared to the first sequence. This more streamlined process does need some extra work to be performed before the I/O is processed, so as to honour the requirement that there is buffer space to perform the data transfer at 306. The forwarded-to node must transfer a permission, commonly termed a ‘credit’, to the forwarding node, which permits it to transmit a certain amount of write data in the future, and the forwarding node must be in receipt of such credit, before it performs that transmission. The transmission consumes the credit, and so as the forwarded-to node executes and completes an I/O process, and buffer space becomes free again, it must create further credit and transmit it to the forwarding node in anticipation of further I/O.
The cache function within caching controllers such as the IBM SAN Volume Controller (hereinafter “SVC”) implements a non-volatile write cache, whereby it will process a write I/O by placing the request's data in non-volatile memory (most often within two nodes), and immediately completes the host I/O. At some later time, it will ‘destage’ the data, which involves sending a write command for that data to the disk which is the normal location for that data. When acknowledgement for that write command is received, the data can be removed from the non-volatile memory contents. The host perceives a much smaller response time for its I/O request than it would see if the request were sent directly to the disk, improving system performance. Non-volatile cache is suitably adapted to the provision of write buffer resource in data storage systems.
It is very common though to avoid issuing this write straight away. A number of advantages can be achieved through this. For example, if the host subsequently sends a further write I/O request for the same location, then that new write I/O request can be processed by replacing the existing data with the data from the later write. At some future time, when a destage write is performed, only the most recent revision of data need to be sent to the disk, saving on the number of disk operations that are performed.
Another important benefit is that when a host application generates a large burst of write I/O, this can be accepted into the non-volatile write cache quickly, and the burst of I/O is forwarded to the disk which might take much longer to process the entire burst. Therefore the host's burst of work is completed much more quickly than would be the case if it were required to wait for the disk, again improving system performance.
However, this scheme can cause problems if the host workload exceeds the ability of the backing disk subsystem for a long period of time. This can happen for instance where a disk subsystem suffers a failure, and enters a degraded performance mode. In this case, the cache memory space within the controller can become exhausted, and in this case write I/O processing must wait for space to be made available from the completion of a destage write. Many of these writes will actually need to wait for the slow controller to process a write I/O (because it is the slow controller that is consuming the majority of the write cache), and so it is possible for all I/O being processed to become backlogged by slow I/O processing in just one backing disk.
The solution to this problem is to limit the amount of cache memory that can be consumed by any one backing disk subsystem. When this scheme operates, I/Os do not automatically get granted buffer space when they are received. In particular, if the write I/O is destined for a disk that is judged to have already consumed its fair share of system resources, then processing of that write I/O is suspended until the share of system resources consumed by that disk and/or its ability to process I/O changes, so it is judged that it is entitled to be granted further resource. In the meantime, other I/O requests that are being processed to disk subsystems which are processing I/O acceptably and are consuming less than the amount of resource than they are entitled to are allowed to continue.
The cache function implemented within SVC is typical of those of many caching controllers, in that for any given host volume (vdisk) it can support I/O on only one or two nodes of the system. The forwarding layer is used ‘above’ the cache layer, (so that the forwarding layer processes a given host I/O before the cache layer), and so this allows all nodes in the system to receive I/O for a vdisk, and that I/O is then forwarded to one of the up to two nodes that is able to process that I/O.
Observe now what can happen when the optimised forwarding scheme above interacts with the cache partitioning algorithm described. The optimised forwarding scheme allocates relatively scarce buffering resource ahead of time, before the cache algorithm is able to judge whether the disk subsystem has consumed more than its fair share of resource. If the cache algorithm acts to delay I/O processing, it stops the I/O from consuming more cache resource, but that I/O request has already consumed buffer space within the forwarding node. This can quickly lead to the forwarding node running out of buffer space to service any I/O request.
This means that the same problem has arisen as was attempted to be solved by the cache partitioning scheme, though the exhaustion here is suffered in the forwarding buffer resource of the forwarding node, rather than the cache buffer resource of the forwarded-to node.
The slower forwarding algorithm outlined above with reference to FIG. 2 does not exhibit this problem. It waits for the cache to decide to process the I/O before committing buffer resource to the request at step 204, and so it only allocates buffer resource to I/Os whose disk subsystem is judged to deserve more resource. But this scheme greatly increases the processing time for the I/O.
What is needed is a technique by which forwarded write I/Os can be processed with minimum response time, but without leading to problems from resource exhaustion when a subset of those I/Os is running slowly.