The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for data integrity and acceleration in compressed storage environments in combination with software defined native RAID.
RAID (redundant array of independent disks) is a data storage virtualization technology that combines multiple disk drive components into a single logical unit for the purposes of data redundancy or performance improvement. RAID systems distribute data across the drives in one of several ways, referred to as RAID levels, depending on the specific level of redundancy and performance required.
A number of standard schemes have evolved. These are called levels. Originally, there were five RAID levels, but many variations have evolved—notably several nested levels and many non-standard levels. RAID levels and their associated data formats are standardized by the Storage Networking Industry Association (SNIA) in the Common RAID Disk Drive Format (DDF) standard:
RAID 0 consists of striping, without mirroring or parity. The capacity of a RAID 0 volume is the sum of the capacities of the disks in the set, the same as with a spanned volume. There is no added redundancy for handling disk failures, just as with a spanned volume. Thus, failure of one disk causes the loss of the entire RAID 0 volume, with reduced possibilities of data recovery when compared to a broken spanned volume. Striping distributes the contents of files roughly equally among all disks in the set, which makes concurrent read or write operations on the multiple disks almost inevitable and results in performance improvements. The concurrent operations make the throughput of most read and write operations equal to the throughput of one disk multiplied by the number of disks. Increased throughput is the big benefit of RAID 0 versus spanned volume.
RAID 1 consists of data mirroring, without parity or striping. Data is written identically to two or more drives, thereby producing a “mirrored set” of drives. Thus, any read request can be serviced by any drive in the set. If a request is broadcast to every drive in the set, it can be serviced by the drive that accesses the data first (depending on seek time and rotational latency), improving performance. Sustained read throughput, if the controller or software is optimized for it, approaches the sum of throughputs of every drive in the set, just as for RAID 0. Actual read throughput of most RAID 1 implementations is slower than the fastest drive. Write throughput is always slower because every drive must be updated, and the slowest drive limits the write performance. The array continues to operate as long as at least one drive is functioning.
RAID 5 consists of block-level striping with distributed parity. RAID 5 requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data are lost. RAID 5 requires at least three disks.
RAID 6 consists of block-level striping with double distributed parity. Double parity provides fault tolerance up to two failed drives. This makes larger RAID groups more practical, especially for high-availability systems, as large-capacity drives take longer to restore. RAID 6 requires a minimum of four disks. As with RAID 5, a single drive failure results in reduced performance of the entire array until the failed drive has been replaced. With a RAID 6 array, using drives from multiple sources and manufacturers, it is possible to mitigate most of the problems associated with RAID 5. The larger the drive capacities and the larger the array size, the more important it becomes to choose RAID 6 instead of RAID 5.
RAID 1+0, also referred to as RAID 10, creates a striped set from a series of mirrored drives. The array can sustain multiple drive losses so long as no mirror loses all its drives.
Software RAID implementations are now provided by many operating systems. Software RAID can be implemented as a layer that abstracts multiple devices, thereby providing a single virtual device, a more generic logical volume manager, a component of the file system, or a layer that sits above any file system and provides parity protection to user data. Some advanced file systems are designed to organize data across multiple storage devices directly without needing the help of a third-party logical volume manager. The General Parallel File System (GPFS), initially developed by IBM for media streaming and scalable analytics, supports de-clustered RAID protection schemes up to n+3. A particularity is the dynamic rebuilding priority which runs with low impact in the background until a data chunk hits n+0 redundancy, in which case this chunk is quickly rebuilt to at least n+1. On top, GPFS supports metro-distance RAID 1.
Data optimization for primary storage is a key initiative for data center managers today. Data center managers are looking for ways to improve storage utilization as well as trying to reduce one of the largest line items in the Information Technology (IT) budget: the cost to maintain a storage environment. Optimizing data on the primary storage tier also has a ripple effect, as cost savings then permeate throughout the data lifecycle. While deduplication captures most of the headlines, it is not the sole option to be considered. An alternative or even potential compliment to deduplication is real-time compression.
Real-time compression is an in-line storage optimization technology often implemented on an appliance that is commonly deployed into storage environments. Logically the appliance sits in front of the storage, processing all data coming into and out of the storage through the real-time compression technology.