1. Field of the Invention
This invention relates to computer data storage systems, and more particularly, to Redundant Array of Inexpensive Disks (RAID) systems and data striping techniques.
2. Description of the Related Art
A continuing desire exists in the computer industry to consistently improve the performance of computer systems over time. For the most part, this desire has been achieved for the processing or microprocessor components of computer systems. Microprocessor performance has steadily improved over the years. However, the performance of the microprocessor or processors in a computer system is only one component of the overall performance of the computer system. For example, the computer memory system must be able to keep up with the demands of the processor or the processor will become stalled waiting for data from the memory system. Generally computer memory systems have been able to keep up with processor performance through increased capacities, lower access times, new memory architectures, caching, interleaving and other techniques.
Another critical component to the overall performance of a computer system is the I/O system performance. For most applications the performance of the mass storage system or disk storage system is the critical performance component of a computer""s I/O system. For example, when an application requires access to more data or information than it has room in allocated system memory, the data may be paged in/out of disk storage to/from the system memory. Typically the computer system""s operating system copies a certain number of pages from the disk storage system to main memory. When a program needs a page that is not in main memory, the operating system copies the required page into main memory and copies another page back to the disk system. Processing may be stalled while the program is waiting for the page to be copied. If storage system performance does not keep pace with performance gains in other components of a computer system, then delays in storage system accesses may overshadow performance gains elsewhere.
One method that has been employed to increase the capacity and performance of disk storage systems is to employ an array of storage devices. An example of such an array of storage devices is a Redundant Array of Independent (or Inexpensive) Disks (RAID). A RAID system improves storage performance by providing parallel data paths to read and write information over an array of disks. By reading and writing multiple disks simultaneously, the storage system performance may be greatly improved. For example, an array of four disks that can be read and written simultaneously may provide a data rate almost four times that of a single disk. However, using arrays of multiple disks comes with the disadvantage of increasing failure rates. In the example of a four disk array above, the mean time between failure (MTBF) for the array will be one-fourth that of a single disk. It is not uncommon for storage device arrays to include many more than four disks, shortening the mean time between failure from years to months or even weeks. RAID systems address this reliability issue by employing parity or redundancy so that data lost from a device failure may be recovered.
One common RAID technique or algorithm is referred to as RAID 0. RAID 0 is an example of a RAID algorithm used to improve performance by attempting to balance the storage system load over as many of the disks as possible. RAID 0 implements a striped disk array in which data is broken down into blocks and each block is written to a separate disk drive. Thus, this technique may be referred to as striping. Typically, I/O performance is improved by spreading the I/O load across multiple drives since blocks of data will not be concentrated on any one particular drive. However, a disadvantage of RAID 0 systems is that they do not provide for any data redundancy and are thus not fault tolerant.
RAID 5 is an example of a RAID algorithm that provides some fault tolerance and load balancing. FIG. 1 illustrates a RAID 5 system, in which both data and parity information are striped across the storage device array. In a RAID 5 system, the parity information is computed over fixed size and fixed location stripes of data that span all the disks of the array. Together, each such stripe of data and its parity block form a fixed size, fixed location parity group. When a subset of the data blocks within a parity group is updated, the parity must also be updated. The parity may be updated in either of two ways. The parity may be updated by reading the remaining unchanged data blocks and computing a new parity in conjunction with the new blocks, or reading the old version of the changed data blocks, comparing them with the new data blocks, and applying the difference to the parity. However, in either case, the additional read and write operations can limit performance. This limitation is known as a small-write penalty problem. RAID 5 systems can withstand a single device failure by using the parity information to rebuild a failed disk.
Additionally, a further enhancement to the several levels of RAID architecture is a an algorithm known as write-anywhere. As noted above in the RAID 5 system, once the data striping is performed, that data stays in the same fixed, physical location on the disks. Thus, the parity information as well as the data is read from and written to the same place. In systems that employ the write-anywhere algorithm, when an update occurs, the parity information is not computed immediately for the new data. The new data is cached and the system reads the unmodified data. The unmodified data and the new data are merged, the new parity is calculated and the new data and parity are written to new locations on the disks within the array group. One system that employs a write-anywhere algorithm is the Iceberg(trademark) system from the Storage Technology Corporation. The write-anywhere technique reduces efficiency overhead associated with head seek and disk rotational latencies caused by having to wait for the head to get to the location of the data and parity stripes on the disks in the arrays.
Although the write-anywhere technique removes the efficiency overhead mentioned above, it is desirable to make further improvements to the system efficiency.
The problems outlined above may in large part be solved by a data storage subsystem including a storage disk array employing dynamic data striping.
In one embodiment, a data storage subsystem includes a plurality of storage devices configured in an array and a storage controller coupled to the storage devices. The storage controller is configured to store a first stripe of data as a plurality of data stripe units across the plurality of storage devices. The plurality of data stripe units includes a plurality of data blocks and a parity block which is calculated for the plurality of data blocks. The storage controller is further configured to store a second stripe of data as a plurality of data stripe units across the storage devices. The second plurality of data stripe units includes another plurality of data blocks, which is different in number than the first plurality of data blocks, and a second parity block calculated for the second plurality of data blocks. Furthermore, the second plurality of data blocks may be a modified subset of the first plurality of data blocks. The storage controller is also configured to store the second plurality of data blocks and the second parity block to new locations.
In various additional embodiments, the storage controller may be configured to keep track of the storage locations and parity group membership. For example, a free segment bitmap may be maintained, which is a listing of the physical segments of the storage devices. The bitmap may include indications of whether the physical segments contain data or not and a pointer indicating where a disk head is currently located. Additionally, a block remapping table consisting of a hashed indirection table and a parity group table may be maintained. The block remapping table maps entries representing logical data blocks to physical segments. The table also maps the membership of the various segments to their respective parity groups.
In another embodiment, the storage controller is configured to realign parity groups by collecting the existing parity groups, which may be of different sizes, and forming new parity groups which are uniformly sized according to a default size. The storage controller calculates new parity blocks for each new parity group and subsequently stores both the new parity groups and the new parity blocks to new locations. Additionally, the storage controller may be further configured to maintain older versions of the existing parity groups.
The data storage subsystem may advantageously improve overall storage system efficiency by calculating a new parity block for the new data and writing just the new data and new parity block to new locations, thereby eliminating the need to read existing data blocks in a parity group prior to modifying any data blocks in the parity group.