A storage apparatus including a plurality of storage devices such as hard disk drives (HDDs) and solid state drives (SSDs) is capable of recovering data by various erasure correction techniques even when a storage device is in failure.
FIG. 21 illustrates an example of a method of recovering from a disk failure in a storage system 100 including a storage apparatus. The storage system 100 illustrated in FIG. 21 includes a disk group in which three stripes (stripe-1 to stripe-3) are established for six HDDs (HDD-1 to HDD-6). In the disk group, information encoded with an erasure code is stored in each of the stripes.
For example, as illustrated in the upper column of FIG. 21, for stripe-2, data (2D1 to 2D4) is stored in four HDDs (HDD-2 to HDD-5) in a distributed manner, and a parity (2P) of stripe-2 is stored in HDD-6. For stripe-2, HDD-1 is a free block which is used as an alternative block when any one of HDD-2 to HDD-6 is in failure. “2D1” represents a first block among data in stripe-2, and “2P” represents a parity block of stripe-2. In the description below, representation of “free” for an alternative block may be omitted.
When HDD-5 in the above-described storage system 100 is in failure, as illustrated in the lower column of FIG. 21, a control device (hereinafter referred to as CM), such as a controller module (CM, not illustrated), of the storage apparatus performs rebuild processing to recover data stored in HDD-5. For example, with respect to stripe-1, the CM acquires data of 1D1 to 1D4 from HDD-1 to HDD-4, performs parity calculation for 1D1 to 1D4 to generate 1P, and writes the generated 1P into an alternative block in HDD-6. Similar operation is performed for other stripes. With respect to stripe-2, the CM generates 2D4 on the basis of data of 2D1 to 2D3 and 2P and writes the generated 2D4 into an alternative block in HDD-1. With respect to stripe-3, the CM generates 3D3 on the basis of data of 3D1, 3D2, 3D4, and 3P and writes the generated 3D3 into an alternative block in HDD-2.
As a related technique, a technique is known in which each block of data is encoded with an erasure code and distributed into a group of storage nodes. When a failure occurs in a storage node, data in the failed node is rebuilt in an unused node on the basis of data in other nodes.
A related technique is disclosed in, for example, Japanese Laid-open Patent Publication No. 2010-79886.
In the storage system 100, block arrangement on each HDD varies in accordance with the management of a free area on the HDD by the CM or the like.
Therefore, as illustrated in the lower column of FIG. 22, when data release and reallocation are repeated in a unit of stripe, data or parity information is stored into blocks of each HDD not in the order of stripes but in a random order. For example, in HDD-3, blocks 2D2 (stripe-2), 1D3 (stripe-1), and 3D1 (stripe-3) are stored in this order from the head of the storage area of the HDD.
A case where HDD-5 is in failure under the state illustrated in the lower column of FIG. 22 is discussed. In this case, constituent blocks of stripe-1 to stripe-3 are arranged randomly in each HDD. Thus, when the rebuild processing is performed by the CM, accesses to the HDDs occur in an order different from the order of addresses on the HDDs, as illustrated in the lower column of FIG. 23.
For example, in HDD-3, the CM first reads 1D3 at an address around the center of the storage area for recovery of stripe-1, and then reads 2D2 at an address around the head of the storage area for recovery of stripe-2. Then, the CM finally reads 3D1 at an address around the end (last address) of the storage area for recovery of stripe-3.
Also, for example, in HDD-1 including an alternative block, the CM first reads 1D1 at an address around the end of the storage area for recovery of stripe-1, and then writes 2D4 generated on the basis of information in other HDDs at an address around the center of the storage area for recovery of stripe-2. Then, the CM finally reads 3P at an address around the head of the storage area for recovery of stripe-3.
Thus, when data release or reallocation is repeated in a unit of stripe in the storage system 100, accesses to each HDD during the rebuild processing are performed in the order of the stripes to be recovered. Thus, accesses are made in a random order.
In the example of FIG. 23, the storage system 100 manages three stripes, and each HDD stores three blocks at the maximum. However, much more stripes are managed in practice, and a great number of blocks are stored in one HDD. For example, in the example illustrated in FIG. 23, when the HDD is a storage device compatible with the Serial Attached Small Computer System Interface (SAS) having readout performance of about 100 MB/s and each block has a small size of about 4 KB, accesses to each HDD are performed as a random access of about 10 MB/s.
Thus, in a storage system including one or more storage apparatuses, when blocks stored in the HDD are not arranged in the order of stripe, readout of the HDD is randomized in the storage apparatuses during the rebuild processing, and thereby performance deteriorates.
The above problem may occur in a storage system storing information encoded with an erasure code as described above. For example, such a storage system has a configuration formed by combining redundant arrays of inexpensive disks 5 (RAID5) and a wide stripe configured to extend the stripes established for the RAID5 to an additional disk. The above problem may occur similarly in a configuration using another RAID technology (for example, RAID6) in place of RAID5 or using a combination of a plurality of RAID technologies, and a storage system using another erasure code.
The above problem may occur similarly even in a storage system not including an alternative block, for example, a storage system using a conventional RAID5 or RAID6, when blocks stored in the HDD are not arranged in the order of the stripe.