1. Field of the Invention
The present invention relates generally to block service storage disk arrays, and more particularly to systems and methods for promoting the reliability and performance of block service computer storage disk arrays.
2. Description of the Related Art
Disk array systems such as reliable arrays of independent disks (RAID) are used to reliably store data by essentially spreading the data over plural disk drives operating in concert. When the below-mentioned technique known as xe2x80x9cparityxe2x80x9d is used among the data stored on the disks, one disk drive can malfunction but the data temporally lost thereby can nevertheless be recovered.
The following discussion, particularly related to a first embodiment of the present invention, illuminates how disk arrays promote reliability and data recoverability. When data is written to the array, it is not written to a single drive in the array. Instead, the data is xe2x80x9cstripedxe2x80x9d across the array. During manufacturing, each drive is divided (in logic) into sequentially numbered blocks, and when the drives are configured into an array, blocks of the drives having the same logical numbers comprise a xe2x80x9cstripexe2x80x9d. A mathematical operation referred to as xe2x80x9cXORxe2x80x9d is performed on the blocks of a stripe to yield a parity strip. Should one of the drives subsequently malfunction, each lost strip of data can be recovered by executing an XOR operation on the remaining data blocks of its stripe, along with the parity strip that had been derived from the stripe, to thereby recover the lost data. In addition to the above consideration of reliability, striping data across the drives of a disk drive array can enhance performance by promoting efficient and rapid data access.
As recognized by the present invention, prior art array systems address reliability concerns either by requiring external user applications to act to ensure reliability, or by entering changes to data in special-purpose, high-performance persistent storage, in addition to physically making the changes. Requiring user applications to undertake the reliability function is onerous on the applications, while entering (in persistent storage) all data to be written, as well as physically writing the data to disk, is duplicative. In other words, as recognized herein, maintaining duplicate records of the dataxe2x80x94one logically, and one physicallyxe2x80x94requires the presence of persistent storage, and can degrade performance.
Database systems that store data on disks address the reliability issue by inserting flags in the data as it is stored. This is possible for database systems to do, because database systems typically format the data to be stored in accordance with their own internal formatting protocol. In the event of a subsequent malfunction, the flags can be used to ensure internal data consistency and integrity.
On the other hand, in the case of a block service, to which the present invention is directed, it is impractical to insert such flags in the data. This is because a block service typically does not reformat data received from, e.g., an operating system for storage. Rather, a block service stores the data as received from the operating system, which generally assumes that the block service will store data in 512 byte sectors. Consequently, were database-like flags to be used by a block service, an entire new sector would be required to store the flags, resulting in wasted space and degraded performance attributable to increased input/output (I/O) operations. Fortunately, the present invention recognizes that it is possible to minimize recording data to improve performance while ensuring data recoverability in the event that one drive of an array malfunctions in a block service device.
With further respect to current RAID systems as considered by a second embodiment of the invention, a so-called RAID 1 storage is designed to efficiently execute small writes to the storage medium, whereas a so-called RAID 5 storage is designed with reliability and efficient execution of large reads and Writes in mind. In RAID-1 storage, also referred to as xe2x80x9cmirror setxe2x80x9d storage, two identical copies of data are maintained on a disk array, whereas in RAID-5 storage, also referred to as xe2x80x9cstrip set with parityxe2x80x9d, data is striped across the disks of an array as described above.
The present invention recognizes, however, that it is not sufficient or trivial to simply combine RAID 1 principles with RAID 5 principles in a single system, without also accounting for heretofore unrecognized hurdles in doing so. For example, in the xe2x80x9cAutoraidxe2x80x9d system marketed by Hewlett-Packard, elements of RAID-1 storage are combined with elements of RAID-5 storage, but because writes to the RAID-5 storage is undertaken using log-structured write principles to promote efficiency, the writes are always relatively large and are always appended to the end of a log. Unfortunately, as recognized herein, this requires significant post-processing (colloquially referred to as xe2x80x9cgarbage collectionxe2x80x9d) and can also destroy the data layout semantics, resulting in degraded performance during subsequent reads. The present invention understands these drawbacks and provides the solutions below.
The invention is a general purpose computer programmed according to the inventive steps herein to update a block service disk array with new data, reliably and with high performance. xe2x80x9cReliabilityxe2x80x9d includes fault tolerance. The invention can also be embodied as an article of manufacturexe2x80x94a machine componentxe2x80x94that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to execute the present logic. This invention is realized in a critical machine component that causes a digital processing apparatus to perform the inventive method steps herein.
Accordingly, a general purpose computer includes at least one memory and at least one computer usable medium that has computer usable code means for storing data on a data storage device having an old data set stored thereon. As disclosed further below, the computer usable code means includes computer readable code means for receiving an update of at least a portion of the old data set. Also, computer readable code means modify, in memory, the old data set using the update, to render a modification. Moreover, computer readable code means write at least a commit record of the modification to a log, and computer readable code means write at least a portion of the modification to the data storage device.
In a first preferred embodiment, the data storage device is used as a block service, and it includes at least one disk array on which data is stored in strides. The strides establish respective data sets, with each stride defining plural strips. Also, the portion of the old data is at least one old strip and the update is at least one new strip, and computer readable code means generate at least one delta parity strip using the old strip and an old parity strip. Furthermore, computer readable code means generate a new parity strip using the delta parity strip and the modification.
In one implementation of the first preferred embodiment, the means for writing the commit record to the log also writes the modification and the new parity strip to the log. Further, the new parity strip and modification are written to the data storage device, with the modification being written to the physical location of the old data set. The parity strips and the modification can be discarded from memory after the parity strips and the modification have been written to the data storage device.
In a second implementation of the first preferred embodiment, the modification is written to a new physical location on the data storage device that is different from the physical location of the old data set. As intended herein, the new physical location is determined using a stride mapping table. In this implementation, the portion of the old data includes plural old strips of a stride, the update is established by plural new strips, and the computer further includes computer readable code means for generating at least one new parity strip using the new strips. The address of the new physical location and the address of the physical location of the old data set are written to the log, without writing the modification and the new parity strip to the log. The entries for the old and new locations in the stride mapping table are exchanged for each other. If desired, the addresses of the physical locations can be discarded from the log after the modification has been written to the data storage device.
In another aspect, for a block service disk array across which data is arranged in strides, with each stride defining a respective strip on a respective disk of the array, a computer-implemented method includes logically writing all stride changes while physically writing ahead to a log only a subset of the changes.
In still another aspect, a computer program device includes a computer program storage device that is readable by a digital processing apparatus. A program means is on the program storage device, and the program includes instructions that can be executed by the digital processing apparatus for performing method acts for storing data on a data storage device. The method acts embodied in the program include receiving an update of at least a portion of an existing stride of data stored on a block service disk array. Also, the method acts embodied by the program include generating a parity data element based at least in part on the update, and determining whether to write just the update to disk or to write a modified version of the entire stride to disk. Still further, the method acts include, if the modified version of the entire stride is to be written to disk, determining a new location to which the modified version of the stride is to be written, it being understood that the new location is different from an old location at which the existing (unmodified) stride is stored. A commit record of the modification is written to a log along with at least the new location, when the modified version of the entire stride is to be or has been written to disk, and otherwise a commit record of the modification is written to a log along with at least the update, when just the update is to be written to disk.
With particular regard to a second embodiment of the present invention, a data storage system includes at least one disk array, at least one RAID-5 area on the disk array for holding data, and at least one RAID-1 area on the disk array. The RAID-5 area defines home locations for data blocks and the RAID-1 area defines temporary locations for data blocks. At least one map is in memory to correlate data blocks having home locations to temporary locations.
Preferably, the system also includes logic means for receiving an in-line write request to write first blocks to disk, and logic means for determining whether prior versions of the first blocks are in temporary locations in the RAID-1 area. If so, the temporary locations are overwritten with the first blocks. Otherwise, it is determined whether sufficient storage space exists in the RAID-1 area to hold the first blocks. If there is, the first blocks are written to the RAID-1 area, and otherwise are written to the RAID-5 area. The temporary locations to which the first blocks are written are recorded in an in-memory map. Also, logic means append map information to a log on the disk array in response to the updating. The preferred map is a hash table.
When the system is idle, logic means move blocks in the RAID-1 area to their home locations in the RAID-5 area. Further, logic means checkpoint the log in response to the means for moving. and logic means retrieve home locations and temporary locations from the log between the end of the log and the latest checkpoint after a controller crash. Block mappings to RAID-1 temporary locations are inserted into a reconstituted map in response to the means for retrieving.
In another aspect, a computer-implemented method for storing data includes receiving a request for a write to disk of first blocks, and, when previous versions of the blocks are in a temporary storage area on disk, overwriting the previous versions in response to the receiving act. Otherwise, the method determines whether sufficient storage space exists in the temporary area to hold the first blocks, and if so, the first blocks are written to the temporary area. If not, the blocks are written to a home area on disk.
In yet another aspect, a computer program product includes a computer program storage device including computer instructions to cause a computer to undertake method acts for storing data. The method acts embodied by the instructions include writing first blocks to a RAID-1 area on a disk array when sufficient storage space exists in the RAID-1 area or when previous versions of the blocks are present in the RAID-1 area, and otherwise writing the first blocks to a RAID-5 area on the disk. Blocks in the RAID-1 area are periodically moved to home locations in the RAID-5 area.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which: