Scalability is an important requirement in all data storage systems. Different types of storage systems provide diverse methods of seamless scalability through capacity expansion. In some storage systems, such as systems utilizing redundant array of inexpensive disk (“RAID”) controllers, it is often possible to add disk drives (or other types of mass storage devices) to a storage system while the system is in operation. In such a system, the RAID controller re-stripes existing data onto the new disk and makes the capacity of the other disks available for new input/output (“I/O”) operations. This methodology, known as “vertical capacity expansion,” is common. However, this methodology has at least one drawback in that it only scales data storage capacity, without improving other performance factors such as the processing power, main memory, or bandwidth of the system.
In other data storage systems, it is possible to add capacity by “virtualization.” In this type of system, multiple storage servers are utilized to field I/O operations independently, but are exposed to the initiator of the I/O operation as a single device, called a “storage cluster.” Each storage server in a cluster is called a “storage node” or just a “node.” When data storage capacity becomes low, a new server may be added as a new node in the data storage system. In addition to contributing increased storage capacity, the new storage node contributes other computing resources to the system, leading to true scalability. This methodology is known as “horizontal capacity expansion.” Some storage systems support vertical expansion of individual nodes, as well as horizontal expansion by the addition of storage nodes.
Systems implementing horizontal capacity expansion may choose to concatenate the capacity that is contributed by each node. However, in order to achieve the maximum benefit of horizontal capacity expansion, it is necessary to stripe data across the nodes in much the same way as data is striped across disks in RAID arrays. While striping data across nodes, the data should be stored in a manner that ensures that different I/O operations are fielded by different nodes, thereby utilizing all of the nodes simultaneously. It is also desirable not to split I/O operations between multiple nodes, so that the I/O latency is low. Striping the data in this manner provides a boost to random I/O performance without decreasing sequential I/O performance. The stripe size is calculated with this consideration, and is called the “zone size.”
When data is striped across multiple nodes, the process of re-striping data when a new node is added is lengthy and inefficient in most contemporary storage systems. In particular, current storage systems require the movement of a massive amount of data in order to add a new node. As an example, in order to expand a four node cluster to a five node cluster using current data migration methodologies, only one in twenty storage zones (referred to herein as “zones”) remains on the same node, and even those zones are in a different position on the node. Hence, the current process of migration is effectively a process of reading the entire body of data in the system according to its unexpanded configuration, and then writing it in its entirety according to expanded configuration of the cluster.
Such a migration process typically takes several days. During this time, the performance of the cluster is drastically decreased due to the presence of these extra migration I/O operations. A complicated method of locking is also required to prevent data corruption during the data migration process. The storage capacity and processing resources of the newly added node also do not contribute to the cluster until the entire migration process has completed; if an administrator is expanding the node in order to mitigate an impending capacity crunch, there is a good likelihood that the existing capacity will be exceeded before the migration completes. In all cases, the migration process is cumbersome, disruptive and tedious.
In addition to scaling storage resources, a storage cluster can also be utilized to provide redundancy and protect against data loss due to the failure of a node. The administrator may configure the cluster so that each zone of data is stored on two or more nodes. In this way, if a single node fails, all of the data that is contained in it can be accessed from another box. One cluster arrangement that is commonly used for this purpose is called chained declustering. In a chained declustered storage system, zones are striped across all of the nodes, and they are also mirrored on at least two nodes.
In a cluster which is configured to provide redundancy, either through chained declustering or otherwise, a single node failure may occur without data loss, and the event of dropping the failed node and recovering its data from the other nodes can be handled in a manner that is transparent to the user. However, during the time that the failed node is down, the system is vulnerable to a second node failure. Two node failures will most likely cause data loss, even in a storage system that has redundancy. The only way to mitigate this possibility of data loss is to ensure that the failed node is repaired or rebuilt as soon as possible. Several attempts have been made to make this process automatic, so that administrator error does not expose the system to the possibility of data loss. One of the most common solutions is through the existence of a hot-spare storage node in the system. When a drive fails, and the data on it loses redundancy, the hot-spare is deployed by the system and the data that was present on the failed drive is rebuilt onto it. When the hot-spare rebuild has been completed, the system regains redundancy. When the failed node is replaced or repaired, it may either function as a new hot-spare, or the cluster may be transformed back to its original configuration, releasing the original hot-spare.
Some storage clusters utilize a dedicated hot-spare storage node. A dedicated hot spare is a separate storage node that is present on the storage cluster, and possibly powered on, ready to receive I/O requests. When any node in a cluster with a dedicated hot-spare fails, the other nodes immediately identify the hot-spare as the rejoining node and rebuild it. In this manner, the cluster is re-formed with redundancy, and a node failure can still be tolerated. However, unless another hot-spare is added, it is not possible to further re-form the cluster.
While the utilization of dedicated hot-spares is popular in the RAID field and in the virtualization space, this solution is a costly one. This is because the resources that are required for hot-spare storage nodes are unused until another node fails. However, in order to prevent availability from being compromised, they must be powered on and ready all the time, contributing to cost without contributing to performance.
It is with respect to these considerations and others that the following disclosure is presented.