In modern storage systems, RAID (Redundant Array of Independent Disks) techniques are known as the preferable technique to achieve high performance and reliability. Among the well known RAID techniques, RAID-6, which can tolerate two failure-disks, has the best balance between storage efficiency and reliability. Erasure-coding technologies can provide both high fault tolerance and high storage efficiency.
While all the erasure coding techniques are feasible in practice, coding schemes based on the Reed-Solomon (RS) code are most popular with their MDS (Maximum Distance Separable) property. The information dispersal algorithm [9] or the Parchive adopted in some schemes or systems [6] indeed are derived from the RS codes. To date, several classes of horizontal MDS array codes have been successfully designed to simultaneously recover double storage node failure including the EVENODD code [1], [2] X code [11], RDP (Row-Diagonal Parity) scheme [3], Libertine Code [7] or their derivative schemes [5], [10]. Although X-code [11] is an elegant two-erasure code, it is not a RAID code since it is a vertical code and does not fit the RAID-6 specification of having coding devices P and Q, where P is a simple parity device [7]. Actually, all non-MDS codes and vertical codes are not implementable in RAID-6 systems [7]. A recent examination of the performance of the codes for RAID-6 using Classic Reed-Solomon codes and Cauchy Reed-Solomon codes based on Open-Source Erasure Coding Libraries concluded that special-purpose RAID-6 codes vastly outperform their general-purpose counterparts and RDP performs the best of these by a narrow margin [8].
See, U.S. Pat. No. 7,577,866, expressly incorporated herein by reference.
Redundant array of independent disks (RAID) is a technology that provides increased storage reliability through redundancy. In RAID, data is replicated across multiple hard drives in a redundant manner so as to facilitate error correction and data recovery in the event of failure of one of the hard drives. See Wikipedia: RAID, en.wikipedia.org/wiki/RAID, last accessed Jun. 29, 2010.
See U.S. Pat. Nos. 7,747,819, 7,743,308, 7,734,868, 7,734,865, 7,725,765, 7,721,146, 7,721,143, 7,716,421, 7,711,793, 7,702,948, 7,702,877, 7,698,592, 7,694,072, 7,685,499, 7,685,462, 7,681,111, 7,681,072, 7,673,167, 7,669,007, 7,664,915, 7,661,058, 7,660,966, 7,657,823, 7,657,705, 7,653,781, 7,650,463, 7,644,304, 7,640,452, 7,634,686, 7,634,618, 7,634,617, 7,631,219, 7,631,218, 7,624,206, 7,617,361, 7,610,446, 7,603,529, 7,601,189, 7,600,151, 7,600,132, 7,594,077, 7,590,780, 7,587,631, 7,587,630, 7,584,325, 7,574,623, 7,574,560, 7,574,542, 7,567,514, 7,565,488, US Pat. App. Nos. 2010/0162088, 20100161898, 2010/0158241, 2010/0153641, 2010/0153640, 2010/0138691, 2010/0138672, 2010/0115331, 2010/0115198, 2010/0106906, 2010/0106904, 2010/0095187, 2010/0095150, 2010/0095060, 2010/0083039, 2010/0079885, 2010/0070705, 2010/0070703, 2010/0064161, 2010/0064103, 2010/0057987, 2010/0050016, 2010/0049914, 2010/0037091, 2010/0037023, 2010/0037022, 2010/0031060, 2010/0030960, 2010/0023814, 2010/0023686, 2010/0011162, 2010/0005482, 2009/0327606, 2009/0327603, 2009/0307426, 2009/0307422, 2009/0307421, 2009/0300282, 2009/0287882, 2009/0287880, 2009/0276566, 2009/0271659, 2009/0249111, 2009/0228650, 2009/0222829, 2009/0210744, 2009/0210742, 2009/0210620, 2009/0210619, and 2009/0204758, the disclosure of each of which is expressly incorporated herein by reference.
A row diagonal parity (RDP) technique reduces overhead of computing diagonal parity for a storage array adapted to enable efficient recovery from the concurrent failure of two storage devices in the array. The diagonal parity is computed along diagonal parity sets that collectively span all data disks and a row parity disk of the array. The parity for all of the diagonal parity sets except one is stored on the diagonal parity disk. The R-D parity technique provides a uniform stripe depth and an optimal amount of parity information. U.S. Pat. No. 7,203,892, the disclosure of which is expressly incorporated herein by reference in its entirety. See also U.S. Pat. No. 7,409,625, U.S. Pat. No. 6,993,701, 2010/0146127, 2008/0227899, 2008/0201457, 2007/0180348, 2006/0142878, 2006/0107135, and 2003/0126523, each of which is expressly incorporated herein by reference.
Definitions of Terms
Assume that there are n symbols in the input symbol (we can call them as n information symbols and denoted as d), and outputs are n+r symbols, where r is the number of redundant/parity symbols that are generated by the erasure codes and denoted as c.
Symbol: the fundamental unit, normally it is a unit of a stored data or a parity. In practice, it could be a disk or a disk sector; it also could be a byte or a bit in one code. However, it does not imply that the symbols are necessarily binary.
Chain: a chain is a set of data symbols and parity symbols that are completely connected, and they are interdependently related according to parity computation relations.
Container: it is a virtual storage medium that contains all symbols belonging to one column of the array, normally it is a disk.
Optimal Erasure Code: it is such type of code that the lost original data symbol can be recovered from any k subset of the n datasets. Optimal Code is also known as Maximum Distance Separable (MDS) Code.
Non-Optimal Erasure Code: the original symbol can only be recovered from n+e subset of the n+r datasets, where e is the extra redundant symbols needed to retrieve from the chain.
Systematic Erasure Code: in this type of coding scheme the first n output code symbols are the original n input information symbols.
Non-Systematic Erasure Code: all the code schemes that are not in the systematic code category.
Update Complexity: this metrics is defined as the average number of parity symbols affected by a change of an individual information symbol in the codes.
Storage Efficiency: it is the ratio of information symbols among the whole symbol set, and normally it is the ratio of k/n.
Regular Structure: the code has regular structure that the coding and decoding procedure can be expressed as a clear equation. The regularity makes it easy to be implemented.
Irregular Structure: there is no clear equation to describe coding or decoding algorithms. However, it is possible to achieve better performances in these procedures.
Array Codes: it is one type of two dimensional codes, in which the parity check equations are given by XOR operations over lines in one or more directions. In fact, all the XOR erasure codes can be viewed as a family of array codes.
Horizontal Code: in this type of code scheme, each symbol of the chain is located in the different containers.
Vertical Code: in contrast to the horizontal code, in vertical code all the information symbols and the redundant symbols that belong to a chain coexist in the same column and are stored in one container.
Out-degree: the number of parity symbols to which an information symbol contributes.
In-degree: the number of information symbols that are involved in computing a parity symbol. Since all erasure codes operate on the ring or finite field, for convenience, we define <m>n=j if and only if j≡m mod n and 0≦j≦n−1. For instance, <7>5=2 and <−2>5=3. For short, we will use <m> in the place of <m> n or m directly when there is no confusion.
The RDP scheme is schematically exemplified as shown in FIGS. 6A-6C. Assuming n=5, and r=2, the information symbols in the data array are represented as d0, d1, d2, d3, d4, and the redundant symbols are represented as c0 and c1. There are two chains: d0d1d2d3d4d5c0 and d0d1d2d3d4c1. If any two symbols are lost, the information symbols can be reconstructed, it is called optimal erasure code or MDS code. Otherwise, it is non-optimal erasure code.
If the beginning five symbols of output codes are d0d1d2d3d4, it is a systematic code, otherwise it is a non-systematic code. Since each information symbol contributes to two parity symbols, the update complexity/out-degree is 2. Since each parity symbols is connected to five information symbols, the in-degree is 5. The storage efficiency is 5/7.
If the encoding and decoding method can be described explicitly using an equation, for example, the c0 is the XOR of all information symbols (see FIG. 6A), it is regular structure code; otherwise, it is non-regular. If more parity symbols can be added to tolerate the loss of more information symbols, for example, the r can be 20, or even more, with the n increases; and when any r symbols are lost, the information symbols can be recovered, the code is said to be resilient; otherwise, the code is limited-tolerant. If the output code is placed as FIG. 6B, any disk can have either the parity or the information symbol, it is a horizontal code; otherwise, it is a vertical code as shown by FIG. 6C.
Although there were some other methods developed earlier for distributed fault tolerant symbol storage such as RAID-4 and RAID-5, EVENODD is the milestone that indicated the XOR method has been applied in the symbol storage area. In EVENODD, the information symbols occupy n disks, where n is a prime number. The redundant symbols should be placed on the other two disks. The total disk number is n+2. For the sake of simplicity, we assume that there are only n−1 information symbols stored on the disks. The ith symbol in the jth disk is referred as aij, where 0<i<n−2, 0≦j≦n+1. The redundant symbols are stored in the last two disks. The EVENODD code scheme can be specified as follows: in an (n−1)×(n+2) array, compute the content of the redundant part based on the information symbols such that the information contained in any two erased disks can be reconstructed from the other n disks. The encoding algorithm of the EVENODD code solves the foregoing problem and requires only XOR operations for computing the redundancy.
Before formally describing the encoding procedure, the following assumptions are made. Suppose that there is an imaginary O-row after the last row, i.e., dn-1,j=0, 0<j<n−1. With this assumption, the array is now an n×(n+2) array. Although this assumption is not necessary, it is useful.
Let
  S  =                    ⊕                  n          -          1                            t        =        1              ⁢          d                        n          -          1          -          t                ,        t            
For each row x, the redundant symbols are obtained according to the follows:
            c              t        ,        0              =                            ⊕                      n            -            1                                    t          =          0                    ⁢              d                  x          ,          t                      ,and
      c          t      ,      1        =      S    ⊕          (                                    ⊕                          n              -              1                                            t            =            0                          ⁢                  d                                    x              -              t                        ,            t                              )      
As shown by the equations, two types of redundancy are obtained: horizontal redundancy and diagonal (slope=−1) redundancy. For the first redundant disk, it is simply the XOR of disks 0, 1, . . . , n−1. In fact, its contents are exactly the same as the parity contents of the parity disk in an equivalent RAID-5 array with one less disk.
The contents on the other redundant disk come from the diagonal redundancy calculated using the formula for ct, 1, which shows that the parity is determined by S. When S is 0, the parity is even parity check; when S is 1, the parity is odd parity check. Because of this parity check characteristics, this code scheme is named EVENODD code by the researchers.
The (n−1)×(n+2) array defined above can recover the information symbols lost in any two columns. Therefore, the minimum distance of the code is three. The encoding procedure is very simple and the implementation of the equations for c0,t and c1,t on top of digital circuits is straightforward. In a more general sense, we can implement the equations in the RAID controller using XOR hardware.
In order to decode the parity, we consider the (n−1)×(n+2) array of symbols such that the last two columns are redundant according to the parity encoding algorithm for c0,t and c1,t. Assume that columns (disks) i and j are failed, where 0≦i≦j<n+1. Let's consider the following four scenarios:
1. i=n and j=n+1
Both the redundant disks have failed. Disk m can be reconstructed using c0,t and disk (n+1) can be reconstructed using c1,t. In fact, the recovery procedure is equivalent to the encoding procedure.
2. i<n and j=n
One redundant disk and one symbol disk have failed. Disk i can be easily recovered as follows:
Let
  S  =            d                        i          -          1                ,                  n          +          1                      ⊕          (                                    ⊕                          n              -              1                                            t            =            0                          ⁢                  d                                    i              -              y              -              1                        ,            x                              )      
where an-1,y=0, 0≦y≦n+1, then
            d              k        ,        i              =          S      ⊕              d                              i            -            1                    ,                      n            +            1                              ⊕              (                                            ⊕                              n                -                1                                                                    y                =                0                                            y                ≠                0                                              ⁢                      d                                          k                +                i                -                y                            ,              y                                      )              ,
where 0≦k≦n−2. And dk,n, 0≦k≦n−2 can be obtained by the equation for dk,i.
3. i<n and j=n+1
One redundant disk and one symbol disk have failed. We can recover disk i using the equation for c0,t. and disk n+1 using S and c1,t once disk i has been reconstructed.
4. i<n and j<n
If both failed disks carry information and we cannot retrieve them using the parities separately as we did in the previous three cases. The information can be recovered through the following steps.
Assume that the imaginary row dn-1,y=0 for 0≦y≦n−1, the diagonal parity S is:
  S  =            (                                    ⊕                          x              =              0                                            n            -            2                          ⁢                  c                      x            ,            0                              )        ⊕          (                                    ⊕                          x              =              0                                            n            -            2                          ⁢                  c                      x            ,            1                              )      
where S actually is the XOR of the symbols in parity columns. Define the horizontal and diagonal syndromes are:
      S    u    0    =                    ⊕        n                              x          =          0                                      x            ≠            i                    ,          j                      ⁢          d              u        ,        x            
and
            S      u      1        ⁢    S    ⊕      a          u      ,              n        +        1              ⊕      (                            ⊕                      n            -            1                                                x            =            0                                              x              ≠              i                        ,            j                              ⁢              a                              u            -            x                    ,          x                      )  
where 0≦u≦n−1. Then, the symbols in columns i and j can be retrieved through the following steps:
i) Initialization:
Set s←−(j−i)−1, and dn-1,x←0 for 0≦x≦m−1
ii) Calculate the symbol:                Let ds,j←Sj+s1⊕ds+(j-i),i, and ds,i←Ss0⊕ds,j         
iii) Execute iteratively:
Set s←s−(j−i),                if s=n−1 then stop,        else go to step ii).        
Redundant array of independent disks version 6 (RAID-6) based on the XOR operations outperforms other RAID systems thanks to the row diagonal parity (RDP) coding that has faster decoding procedure than others. However, when the RDP code scheme was designed, the advantages of parallel processes or multi-core processors have not been considered. It cannot take full advantage of the hardware implementation of XOR codes. This patent application presents an optimal parallel decoding scheme called extended row diagonal parity (EDP), an extension of the double-erasure correcting RDP code. The EDP coding can improve decoding velocity of RDP scheme by 50% without any change for the current RDP configuration for storage.
FIG. 7 (see U.S. Pat. No. 7,702,660, expressly incorporated herein by reference), shows a block diagram that illustrates a computer system 400 upon which an embodiment of the invention may be implemented. Computer system 400 includes a bus 402 or other communication mechanism for communicating information, and a processor 404 coupled with bus 402 for processing information. Computer system 400 also includes a main memory 406, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 402 for storing information and instructions to be executed by processor 404. Main memory 406 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 404. Computer system 400 further includes a read only memory (ROM) 408 or other static storage device coupled to bus 402 for storing static information and instructions for processor 404. A storage device 410, such as a magnetic disk or optical disk, is provided and coupled to bus 402 for storing information and instructions.
Computer system 400 may be coupled via bus 402 to a display 412, such as an liquid crystal display (LCD), for displaying information to a computer user. An input device 414, including alphanumeric and other keys, is coupled to bus 402 for communicating information and command selections to processor 404. Another type of user input device is cursor control 416, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 404 and for controlling cursor movement on display 412. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane. Of course, other user input devices may be employed.
According to one embodiment of the invention, various techniques are performed by computer system 400 in response to processor 404 executing one or more sequences of one or more instructions contained in main memory 406. Such instructions may be read into main memory 406 from another machine-readable medium, such as storage device 410. Execution of the sequences of instructions contained in main memory 406 causes processor 404 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software. The processor may be a single core processor, multiple traditional core processor, single-instruction multiple data (SIMD) processor, CPLD, ASIC, or other type of processor.
The term “machine-readable medium” as used herein refers to any medium that participates in providing data that causes a machine to operation in a specific fashion. In an embodiment implemented using computer system 400, various machine-readable media are involved, for example, in providing instructions to processor 404 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 410. Volatile media includes dynamic memory, such as main memory 406. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 402. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications. The media may be tangible to enable the instructions carried by the media to be detected by a physical mechanism that reads the instructions into a machine.
Common forms of machine-readable media include, for example, flash drive, hard disk, or any other magnetic medium, a CD-ROM, DVD-ROM, any other optical medium, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read.
Various forms of machine-readable media may be involved in carrying one or more sequences of one or more instructions to processor 404 for execution. For example, the instructions may initially be carried on a magnetic disk of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a copper wire or cable line using a modulator/demodulator (modem). A modem local to computer system 400 can receive the data on the wire or cable and use an infra-red or radio frequency transmitter to convert the data to an electromagnetic signal. An electromagnetic signal detector can receive the data carried in the signal and appropriate circuitry can place the data on bus 402. Bus 402 carries the data to main memory 406, from which processor 404 retrieves and executes the instructions. The instructions received by main memory 406 may optionally be stored on storage device 410 either before or after execution by processor 404.
Computer system 400 also includes a communication interface 418 coupled to bus 402. Communication interface 418 provides a two-way data communication coupling to a network link 420 that is connected to a local network 422. For example, communication interface 418 may be an integrated services digital network (ISDN) card or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 418 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 418 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Network link 420 typically provides data communication through one or more networks to other data devices. For example, network link 420 may provide a connection through local network 422 to a host computer 424 or to data equipment operated by an Internet Service Provider (ISP) 426. ISP 426 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the “Internet” 428. Local network 422 and Internet 428 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 420 and through communication interface 418, which carry the digital data to and from computer system 400, are exemplary forms of carrier waves transporting the information.
Computer system 400 can send messages and receive data, including program code, through the network(s), network link 420 and communication interface 418. In the Internet example, a server 430 might transmit a requested code for an application program through Internet 428, ISP 426, local network 422 and communication interface 418.
The received code may be executed by processor 404 as it is received, and/or stored in storage device 410, or other non-volatile storage for later execution.