Scalability is an important requirement in many data storage systems, particularly in network-oriented storage systems such as network attached storage (NAS) systems and storage area network (SAN) systems. Different types of storage systems provide diverse methods of seamless scalability through storage capacity expansion. In some storage systems, such as systems utilizing redundant arrays of inexpensive disk (“RAID”) controllers, it is often possible to add disk drives (or other types of mass storage devices) to a storage system while the system is in operation. In such a system, a RAID controller re-stripes existing data onto a new disk and makes the capacity of the other disks available for new input/output (“I/O”) operations. This methodology, known as “vertical capacity expansion,” is common. However, this methodology has at least one drawback in that it only scales data storage capacity, without improving other performance factors such as the processing power, main memory, or bandwidth of the system.
In other data storage systems, it is possible to add capacity by “virtualization.” In this type of system, multiple storage servers are utilized to field input/out (I/O) operations (i.e., reads and writes) independently, but are exposed to the initiator of the I/O operation as a single device, called a “storage cluster.” Each storage server in a cluster is called a “storage node”, a “data node” or just a “node.” When available data storage capacity becomes low, a new server may be added as a new node in the data storage system. In addition to contributing increased storage capacity, the new storage node contributes other computing resources to the system, leading to true scalability. This methodology is known as “horizontal capacity expansion.” Some storage systems support vertical expansion of individual nodes as well as horizontal expansion by the addition of storage nodes.
Systems implementing horizontal capacity expansion may concatenate the capacity that is contributed by each node. However, in order to achieve the maximum benefit of horizontal capacity expansion, it is common to stripe data across the nodes in a similar manner to how data is striped across disks in RAID arrays. While striping data across nodes, the data is stored in a manner that ensures that different I/O operations are fielded by different nodes, thereby utilizing all of the nodes simultaneously. It is also desirable to avoid splitting I/O operations between multiple nodes, so that the I/O latency is low. Striping the data in this manner provides a boost to random I/O performance without decreasing sequential I/O performance. Each stripe in this type of implementation is called a “storage zone”, “data zone”, or just “zone.” Each node may contain multiple zones.
In order to provide data reliability, multiple data zones can be grouped as a reliability group. A reliability group provides data reliability for the data zones by including parity zone(s). Each data zone in the reliability group may reside on a separate node; or some data zones in the reliability group may reside on one node. In addition to the data zones, the reliability group may also include one or more parity zones. The parity zones may also reside on separate nodes. A parity zone contains reliability data encoded from the data of the data zones of its reliability group. Similar to the parity concept in RAID systems, the parity zones provide an error protection scheme for the data within the reliability group. In case one or more data zones of the reliability group is inaccessible or contains erroneous data, the reliability data in the parity zones may be utilized in combination with data in the still-accessible zones to correct the error or restore a copy of the data in the inaccessible data zone(s).
However, data zones and parity zones of a reliability group typically reside on separate nodes. In order to restore data or correct an error using the reliability data in a parity zone, other data nodes also needs to transmit data in other data zones to the node having the reliability data. The situation involves a large number of network requests for exchanging data between nodes and poses serious I/O burdens on the data nodes. For a data storage cluster containing a large number of nodes, this can cause severe performance issues.