Clusters are groups of computers that use groups of redundant computing resources in order to provide continued service when individual system components fail. More specifically, clusters eliminate single points of failure by providing multiple servers, multiple network connections, redundant data storage, etc. Clustering systems are often combined with storage management products that provide additional useful features, such as journaling file systems, logical volume management, data volume replication, multi-path input/output (I/O) functionality, etc.
Where a cluster is implemented in conjunction with a storage management environment, the computer systems (nodes) of the cluster can access shared storage, such that the shared storage looks the same to each node. The shared storage is typically implemented with multiple underlying physical storage devices, which are managed by the clustering and storage system so as to appear as a single storage device to the nodes of the cluster. The multiple physical storage media can be grouped into a single logical unit which is referred to as a LUN (for “logical unit number”), and appears as a single storage device to an accessing node.
The management of underlying physical storage devices can also involve software level logical volume management, in which multiple physical storage devices are made to appear as a single logical volume to accessing nodes. A logical volume can be constructed from multiple physical storage devices directly, or on top of a LUN, which is in turn logically constructed from multiple physical storage devices. A volume manager can concatenate, stripe together or otherwise combine underlying physical partitions into larger, virtual ones. In a clustering environment, a cluster volume manager extends volume management across the multiple nodes of a cluster, such that each node recognizes the same logical volume layout, and the same state of all volume resources at all nodes.
Data volumes can also be replicated over a network to a remote site. Volume replication enables continuous data replication from a primary site to a secondary site, for disaster recovery or off host processing. In order for the secondary to be usable, the order of write operations (write-order fidelity) occurring at the primary must be maintained. Therefore, for volume replication in a clustering environment, the order of writes is typically maintained in a log (the replication log), and one of the nodes in the cluster is designated as the logowner.
When a node in the cluster other than the logowner wishes to write to the shared storage, the node first sends a request to the logowner node. The logowner assigns a position in the replication log for the write, and responds to the requesting node with a message indicating the assigned position. After receiving the response from the logowner, the node writes to the assigned position in the replication log, and then to the target data volume. When the logowner itself performs a write, it assigns itself a position in the replication log, writes to that position and then writes to the data volume. Thus, the order of the write operations to the volumes of the primary is preserved in the replication log. Because the log is used to replicate the writes to the secondary in first in first out order, write-order fidelity is preserved in the replication of the data volumes.
Different nodes in a cluster have different storage performance characteristics, depending upon hardware, software, the paths between the node and the storage devices and other layers in the node's storage stack. Some of these factors can also vary dynamically, depending upon the I/O load, available CPU and memory, etc. Thus, different individual nodes have different upper limits of how many outstanding I/O requests can be managed at any given time. If the number of outstanding requests reaches the upper limit, new I/O requests on that node are throttled (e.g., by the SCSI layer), thereby slowing down the node's storage I/O logarithmically. However, because the logowner node processes incoming write requests in first in first out order, an individual node making a large number of requests can be assigned more slots in the replication log than it can process without self-throttling. Because writes are made to the replication log before the shared storage in order to preserve write-order fidelity, this node level throttling can become a bottleneck that negatively impacts cluster wide I/O performance. In other words, other nodes can delayed from executing their own write operations while waiting for a self-throttled node to process its delayed operations which are over the limit of what it can simultaneously manage, even where the storage media could handle a greater I/O load.
It is also of note that the replication log typically resides on storage hardware that is faster than the storage devices backing the data volumes (e.g., a solid state drive as opposed to slower magnetic media). This is the case because the log must be fast enough to handle writes to multiple replicated volumes. Additionally, because the replication log is considerably smaller than the data volumes, it is economically feasible to use more expensive storage with better access times to back the replication log. However, the difference in performance between the replication log and data volumes causes the writes to the latter to lag behind, creating a bottleneck. The replication log contains a limited number of slots for writes, and when all of these slots are in use, incoming writes from any node must be throttled until the logged writes have been flushed to the replicated volumes. When a particular node (or a given subset of the nodes) of the cluster perform continuous I/O operations, other nodes can have their writes throttled for unacceptably long periods of time.
In either of these scenarios, a heavy I/O load from a given node of the cluster can cause the problem of I/O starvation for the other nodes. More specifically, a given node attempting to execute a sufficient number of write operations can result in self-throttling as described above. If the node is allocated more slots in the replication log can it can efficiently process, other nodes of the cluster are unable to execute their own write operations while waiting for the self-throttled node to process its delayed operations. Thus the other nodes become I/O starved, even though the storage media could handle a greater I/O load. Additionally, when the heavy I/O operations of a particular node tie up the limited capacity of the replication log, the other nodes are starved until the logged operations from the monopolizing node have been flushed to the underlying storage volumes. It is clearly undesirable for the other nodes of the cluster to be I/O starved while an individual node monopolizes the replication log.
Another issue is that because of the master/slave relationship between the logowner node and the other nodes of the cluster, the logowner node typically has less write latency than the slave nodes. Whereas the logowner can complete its own writes, other nodes must make requests to the logowner and be granted slots in the replication log as part of the write process. Yet, many applications rely on reasonably uniform throughput from all the nodes of the cluster.
Additionally, some write operations are synchronous or otherwise highly latency sensitive, whereas others are asynchronous. For operation continuity, applications can require guaranteed completion of their latency sensitive I/Os at higher levels of priority.
It would be desirable to address these issues.