In a data storage array comprising multiple hard disks, random duplicate allocation (RDA), e.g. as described by Sanders et al in “Fast concurrent access to parallel disks”, SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on discrete algorithms, pp 849-858, is a technique for allocating data to a disk redundantly and randomly so as to avoid the possibility that a small number of the disks will become disproportionately heavily loaded compared to the rest, which could bottle-neck the performance.
In a standard deployment of RDA, requests for large objects, which comprise many blocks, may not map to a set of parallel sequential I/Os from all disks in the array. As a result, the full sequential bandwidth of the array is not generally achieved. In fact, for standard block sizes, the seek time is usually greater than the data transfer. Therefore, a standard RDA layout achieves no better than half the sequential bandwidth for large reads. To see this, consider a three disk 2-RDA array, with a single 6 MB record written to it, which is subdivided into 1 MB blocks. Table 1 below shows an example RDA layout for that record. “Xn” denotes location of a physical block, “X” is the disk label from set of A, B and C, and “n” is the block number.
TABLE 1Logical block #123456Copy 1, physical locationB4A1A2C1B3A6Copy 2, physical locationA3C2B2A4A5C3
Suppose that the average seek time for a block plus the rotational latency is the same as the block data transfer time. For simplicity of calculations let this time equal 1 unit. Read and write operations will now be considered separately. For reads, the theoretical optimum is to read two sequential blocks from each of the three disks (A, B, C) in parallel. This would take 3 time units (including the initial seek). A basic algorithm which reads each logical block one by one, would take 6 units for data transfer plus 6 units for seeks, divided by 3, if the load is shared equally between all physical disks and they are accessed in parallel. 4 time units is equivalent to 75% of the array bandwidth. A more optimal algorithm could look ahead in the logical block space and detect the opportunity to read logical block 2 and 3 in sequence from physical disk A, logical block 3 and 5 from disk B and finally logical block 4, 5 and 6 from disk A. However, it is undesirable to take advantage of both sequential reads from disk A since this would put too much load on this one disk. Also, the sequence on disk B overlaps in the logical address space with the two sequences on disk A. Reading overlapping sequences would read duplicates for some of the logical blocks. Therefore, the algorithm would schedule a sequential read for a two block subsequence (from disk A or B), and let that disk complete its I/O one time unit earlier (due to one removed seek). This improves the average array bandwidth to around 82%. An optimal algorithm, which considers the entire record and buffers up logical blocks read ahead of time, could remove yet another seek (for example by reading A1, A2; B3, B4; C1, C3), improving the bandwidth to 90%. However, there is no way to achieve the full bandwidth for that particular layout. Also the optimal algorithm relies on O(record size) buffers, and may not bring the first logical block until the very last block read.
For write, similarly to reads, the naive algorithm which writes each logical block at a time would fail to take advantage of possible sequential writes, except of the sequence A1-A2. The lookahead algorithm would write the following sequentially: A1-A2, B2-B3 and A4-A6. Finally, the optimal, reordering algorithm could dispatch sequences A1-A6, B2-B4, C1-C3. However, this suffers from two major disadvantages: first, the entire record may have to be available ahead of the write and second, the loads handled by each of the disks may not be equal. In the above example disk A stores significantly more of the record than disk B and C.