1. Field of the Invention
This invention relates to computer data storage systems, and more particularly, to Redundant Array of Inexpensive Disks (RAID) systems and data striping techniques.
2. Description of the Related Art
A continuing desire exists in the computer industry to consistently improve the performance and reliability of computer systems over time. For the most part, the desire for improved performance has been achieved for the processing or microprocessor components of computer systems. Microprocessor performance has steadily improved over the years. However, the performance of the microprocessor or processors in a computer system is only one component of the overall performance of the computer system. For example, the computer memory system must be able to keep up with the demands of the processor or the processor will become stalled waiting for data from the memory system. Generally computer memory systems have been able to keep up with processor performance through increased capacities, lower access times, new memory architectures, caching, interleaving and other techniques.
Another critical component to the overall performance of a computer system is the I/O system performance. For most applications the performance of the mass storage system or disk storage system is the critical performance component of a computer""s I/O system. For example, when an application requires access to more data or information than it has room in allocated system memory, the data may be paged in/out of disk storage to/from the system memory. A page may be a unit (e.g. fixed number of bytes) of data that is brought into system memory from disk storage when a requested item of data is not already in system memory. Typically the computer system""s operating system copies one or more pages from the disk storage system to system memory. A page may be a fixed number of bytes recognized by the operating system. When a program needs a page that is not in main memory, the operating system copies the required page into main memory and copies another page back to the disk system. Processing may be stalled while the program is waiting for the page to be copied. If storage system performance does not keep pace with performance gains in other components of a computer system, then delays in storage system accesses may overshadow performance gains elsewhere. Computer storage systems must also reliably store data. Many computer applications cannot tolerate data storage errors. Even if data errors are recoverable, data recovery operations may have a negative impact on performance.
One method that has been employed to increase the capacity, performance and reliability of disk storage systems is to employ an array of storage devices. An example of such an array of storage devices is a Redundant Array of Independent (or Inexpensive) Disks (RAID). A RAID system improves storage performance by providing parallel data paths to read and write information over an array of disks. By reading and writing multiple disks simultaneously, the storage system performance may be greatly improved. For example, an array of four disks that can be read and written simultaneously may provide a data rate almost four times that of a single disk. However, using arrays of multiple disks comes with the disadvantage of increasing failure rates. In the example of a four disk array above, the mean time between failure (MTBF) for the array will be one-fourth that of a single disk. It is not uncommon for storage device arrays to include many more than four disks, shortening the mean time between failure from years to months or even weeks. RAID systems may address this reliability issue by employing parity or redundancy so that data lost from a device failure may be recovered.
One common RAID technique or algorithm is referred to as RAID 0. RAID 0 is an example of a RAID algorithm used to improve performance by attempting to balance the storage system load over as many of the disks as possible. RAID 0 implements a striped disk array in which data is broken down into blocks and each block is written to a separate disk drive. Thus, this technique may be referred to as striping. I/O performance may be improved by spreading the I/O load across multiple drives since blocks of data will not be concentrated on any one particular drive. However, a disadvantage of RAID 0 systems is that they do not provide for any data redundancy and are thus not fault tolerant.
RAID 5 is an example of a RAID algorithm that provides some fault tolerance and load balancing. FIG. 1 illustrates a RAID 5 system, in which both data and parity information are striped across the storage device array. In a RAID 5 system, the parity information is computed over fixed size and fixed location stripes of data that span all the disks of the array. Together, each such stripe of data and its parity block form a fixed size, fixed location parity group. When a subset of the data blocks within a parity group is updated, the parity must also be updated. The parity may be updated in either of two ways. The parity may be updated by reading the remaining unchanged data blocks and computing a new parity in conjunction with the new blocks, or reading the old version of the changed data blocks, comparing them with the new data blocks, and applying the difference to the parity. However, in either case, the additional read and write operations can limit performance. This limitation is known as a small-write penalty problem. RAID 5 systems can withstand a single device failure by using the parity information to rebuild a failed disk.
Additionally, a further enhancement to the several levels of RAID architecture is a an algorithm known as write-anywhere. As noted above in the RAID 5 system, once the data striping is performed, that data stays in the same fixed, physical location on the disks. Thus, the parity information as well as the data is read from and written to the same place. In systems that employ the write-anywhere algorithm, when an update occurs, the new data is not immediately merged with the old data. The new data is cached and the system reads the unmodified data. The unmodified data and the new data are merged, the new parity is calculated and the new data and parity are written to new locations on the disks within the array group. The write-anywhere technique may reduce overhead associated with head seek and disk rotational latencies caused by having to wait for the head to get to the location of the data and parity stripes on the disks in the arrays. Although the write-anywhere technique may alleviate some of the efficiency overhead mentioned above, it is desirable to make further improvements to the system efficiency.
Another problem encountered with disk storage systems is that disk drives may occasionally corrupt data. The corruptions may occur for various different reasons. For example, firmware bugs in the disk drive controller""s firmware may cause bits in a sector to be modified or may cause blocks to be written to the wrong address. Such bugs may cause storage drives to write the wrong data, write the correct data to the wrong place, or not write at all. Another source of errors may be a drive""s write cache. Many disk drives employ write caches to quickly accept writes so that the host or array controller can continue with other commands. The data is later copied from the write cache to the disk media. However, write cache errors may cause some acknowledged writes to never reach the disk media. The end result of such bugs or errors is that the data at a given block may be corrupted or stale (e.g. not the current version). These types of errors may be xe2x80x9csilentxe2x80x9d because the drive may not realize that it has erred. If left undetected, such errors may have detrimental consequences, such as undetected long term data corruption. Depending on how long backup copies are kept, or if they are even kept at all, such undetected errors may not even be fixable via backup.
Conventional RAID organizations do not offer protection for such silent errors. Typical RAID systems may recover well from an xe2x80x9cidentifiable failurexe2x80x9d, such as a broken disk drive (e.g. a disk drive not responding to commands). However, typical RAID systems may not be able to easily or efficiently recover from silent disk drive errors. A RAID stripe""s integrity could be checked upon each read or update to check for such errors. However, this option would generate a great deal of I/O operations. For example, if only a single block was read or updated, all blocks of the stripe including the parity block would have to be read, parity calculated, and then checked against the old parity. Also, if the stripe is incorrect (e.g. the XOR of all data blocks do not match the parity block), there is no way to know which block or blocks are wrong.
In a system in which a host computer interacts with a storage array via virtual address, each data block may have a virtual block address. When a data block is written to the storage system, a physical location may be chosen by the storage system at which the data block is stored within the storage system. An indirection map may be maintained which matches virtual block address (used by the host system or file system) to physical block address (e.g. address of the actual location on a storage device of the storage array where a data block is stored). Data blocks may be organized within the storage system as stripes in which the blocks of a stripe are stored across multiple different storage devices of a storage array. A stripe may be a parity group in which multiple data blocks and a parity block for the data blocks are stored as a stripe across the storage devices. Dynamic striping may be employed so that new writes form new parity groups. Thus, stripes of various sizes may be supported by the storage system. For example, if a subset of data blocks of a current parity group are modified by a write transaction, instead of recalculating the parity for the current stripe and rewriting the modified data blocks and parity block of the current strip, a new parity group is created of only the modified blocks and a new parity block may be calculated and stored for the new parity group.
An indirection map is maintained for mapping virtual addresses to physical addresses. The indirection map may also include a parity group pointer for each data block that points to a next member of that parity group, thus linking all the blocks of a particular stripe together. With dynamic striping if a particular stripe is written and then later a part of that strip is updated, there""s no need to perform a partial stripe write and recalculation of parity that is so inefficient in conventional systems. Instead, the newly written blocks become part of a new stripe. The unmodified blocks in the original stripe and the newly modified blocks may later be coalesced into a new stripe having a default size number of blocks. The recoalescing of different size stripes may be accomplished via pointer adjustment in the indirection map.
Each blocks entry in the indirection map may also include a checksum for that block. In some embodiments the checksum may be relatively small, e.g. only a few bytes. Thus, it""s inclusion in the indirection map does not significantly change the size of the map. Furthermore, no extra I/O operations are needed to read the checksum since checksum lookup may be combined with the physical address lookup. When a block is read, its indirection map entry is read to find the block""s physical address and retrieve the block""s checksum. If a block is written, its indirection map entry is updated to indicate the new physical address for the block and the new checksum is also written to the indirection map entry. Any mechanism used to cache and manage indirection map entries will also cache and manage the checksums.
A storage system may include a plurality of storage devices each having a plurality of addressable locations for storing data. A storage controller may be coupled to the storage devices and configured to store and retrieve data from the storage devices. An indirection map may be stored within the system having a plurality of map entries each configured to map a virtual address to a physical address on the storage devices. Each map entry may also store a checksum for data stored at the physical address indicated by the map entry. The storage controller may receive storage requests specifying a virtual address and may access the indirection map for each storage request to obtain the corresponding physical address and checksum.
In one embodiment, a storage controller or array controller may be configured to store a stripe of data as a parity group across a number of the storage devices. The parity group includes a plurality of data blocks and a parity block calculated for the data blocks. The storage controller may receive a write transaction modifying a subset of the data blocks. The controller may calculate a new parity block for the subset of data blocks and store the modified subset of blocks and new parity block as a new parity group at new physical addresses striped across the storage devices. The controller also stores checksums for each block of the parity groups.
A method for storing data in a storage system may include storing a stripe of data across a plurality of storage devices. The data stripe includes a plurality of data blocks and a parity block calculated for the data blocks. The method may further include storing entries in an indirection map for each data stripe unit, and each entry may map a virtual address to a physical address for one of the data stripe units and store a checksum for that data stripe unit. The method may further include receiving a write transaction specifying the virtual addresses of a subset of the data blocks of a data stripe. A new parity block may be calculated for the subset of the data blocks, and the method may include storing only that subset of data blocks and the new parity block as a new parity group to new physical addresses striped across the storage devices. The method may also include updating the entries in the indirection map for the data blocks modified by the write transaction to indicate the new physical address and checksum for each modified data block.
A method for storing data in a storage system may include storing data stripe units across a plurality of storage devices and storing entries in an indirection map for each data stripe unit. Each indirection map entry maps a virtual address to a physical address and further stores a checksum for the stripe unit corresponding to that entry. A read request may be received specifying the virtual address of one of the stripe units and the indirection map entry corresponding to the virtual address may be accessed to obtain the physical address and corresponding checksum. In response to the read request the stripe unit at the physical address mapped to the virtual address indicated by the read request and the corresponding checksum may be returned.