1. Field
The present invention generally relates to storage technology and more particularly to a transaction-based storage system and method for managing file and block data, which uses variable sized objects to store data.
2. Description of Related Information
Historically, computer storage has followed an approach as shown generally in FIG. 1. Physically, a computer 10 contains a disk controller 20—a piece of hardware which provides an electrical connection to a disk. Normally, the disk controller 20 is a chip or card in the system. The controller is electrically connected to one or more disk drives 30 which are used to store and retrieve data.
RAID (redundant array of independent disks) is a way of storing the same data in different places (thus, redundantly) on multiple disks. By placing data on multiple disks, I/O operations can overlap in a balanced way, improving performance. Since multiple disks increase the mean time between failure (MTBF), storing data redundantly also increases fault-tolerance. A RAID appears to the operating system of the computer to be a single logical hard disk. As discussed below in greater detail, RAID employs the technique of striping, which involves partitioning each drive's storage space into units of varying size.
The stripes of all the disks are typically interleaved and addressed in order. Some important abstractions are associated with RAID. (These functions are sometimes implemented in hardware—in the controllers, in software in the volume managers or in out-of-the-box devices which pretend to be very large disks to the disk controller.) The following discussion covers some of the more relevant types of RAID.
RAID 0 is actually a fairly old technique. It was originally known as striping. It operates by taking several identical disks and remapping the logical disk addresses such that sequential transfers follow the following pattern: On the first disk, read all sectors from a cylinder (track by track). Next read all sectors from the corresponding cylinder on the second disk. Repeat this until all disks are visited. (This is called a stripe.) Then seek to the next cylinder on the first disk and repeat. (The actual definition of stripe varies in detail from implementation to implementation. However, the key point is that a stripe contains data components which, when written or read involve all data disks.)
RAID 1 was originally known as mirroring. In this technique, two (or more) identical disks are kept as exact duplicates. Read operations can be dispatched to any available disk. This makes read operations run faster when there are enough outstanding requests to keep all of the disks busy. Write operations must write on all disks which makes write operations somewhat slower than the single disk scenario. However, most modern disk subsystems have enough buffering to minimize this penalty. Sequential reads are really no faster than a single disk. Sequential writes have analogous overhead since all disks must be updated at once.
RAID 4 is a technique applied to arrays with 3 or more identical disks. One disk is designated the parity disk and the remainder are data disks. In essence, the data disks are arranged in a RAID 0 configuration. As a result, read operations have similar performance characteristics as a RAID 0 configuration with n−1 disks. However, the parity disk contains redundant information—information which is “extra” and allows the contents of one of the other drives to be deduced in case of failure. Updating the data disks requires updating the parity disk so that at any time any one disk can be lost and have the RAID 4 continue to operate (at a degraded level) without loss of data.
Parity is a binary operation calculated through the use of XOR operations. In essence it is a count of whether the total number of ‘1’ bits is even or odd. In the case of RAID 4, the parity is calculated across the disks. For example, the parity disk's sector 0 is the parity calculated from the data disks' sector 0. The parity is calculated by taking the first bit in sector 0 on each data disk, XORing the bits together. The result is the first bit in the parity disk's sector 0. This process is repeated for each bit in the sector. A 512 byte sector contains 4096 bits which could consume quite a bit of time. However, modern 64-bit CPUs can typically perform the calculation on 64 bits at a time reducing the effort to perform the parity calculations dramatically. FIG. 2 is a chart showing representative CPU clock counts for parity calculations for various widths of RAID 4 implementations using a Pentium III (and not well optimized code).
If a disk drive in a RAID 4 fails for any reason, the parity information makes it possible to calculate the contents of the failed disk. For example, assume that the host wishes to access a particular sector in the array which happens to map to a drive which has failed. The RAID 4 subsystem would instead read the corresponding sectors in all of the other disks and calculate the parity of these sectors. The result of the parity calculation is the original contents of the data in the failed disk. This technique can be used either online—to allow the RAID 4 to continue to operate in the face of a failure or offline—to rebuild the contents of the lost disk into a fresh new disk installed into the array. (Most arrays can continue to operate online but some must go offline to rebuild a new disk once it is available.)
Some advantages of RAID 4 include: Reliability—RAID 4 can survive the complete failure of any one of its component disks. Space Efficiency—RAID 4 consumes only 1/n of the storage for redundant storage which is less than mirroring. Common implementations will set n to values in the 3 to 8 range so the corresponding savings in space can be large and the cost savings important. Expandability—RAID 4 arrays can be expanded the same way RAID 0s can be expanded. In fact, if the new disk is already initialized to all 0's, it can be inserted without revisiting the parity information. Sequential Read performance—RAID 4 can provide sequential bandwidth proportional to n−1 times the throughput of a single disk. For some classes of applications (such as streaming media) this can be extremely valuable.
Some disadvantages of RAID 4 include: Slow Writes—The RAID write bottleneck is a huge problem for most environments. A RAID 4 can process on the order of ½ the number of small write operations per unit time as a single disk. For a RAID 4 built from 5400 RPM disks, this translates into a peak of approximately 45 write operations per second. Added complexity compared to RAID 0 or RAID 1. Requires all disks to be identical size.
RAID 5 is a seemingly small modification to RAID 4 but it completely changes the result. Where RAID 4 has a dedicated parity disk, RAID 5 uses a “distributed” parity approach. RAID 5 decides to abandon the dedicated parity disk and instead to spread the parity information throughout all n disks. For example, the parity information for the first stripe could be on drive 0, the second stripe on drive 1, etc. The most common pattern is a ‘barber pole’ whereby the parity for each stripe moves to a higher disk drive from the previous stripe.
RAID 10 is really RAID 1+RAID 0. It is simply a RAID 0 created out of mirrored disks (or if you prefer, a mirrored RAID 0). This approach is used where maximum reliability and throughput are required and cost is not a concern. However, RAID 10 cannot survive the loss of any 2 disks so it is actually not much more reliable than RAID 4 or RAID 5. But, RAID 10 does not have the same write bottleneck as RAID 4 or RAID 5 but wastes 50% of its disk storage.
RAID 41 or Mirrored RAID 4s is extremely uncommon, but is relevant to the present discussion. In essence, it is a RAID 4 created out of mirrored disks. The result is extremely robust at the cost of storage efficiency. RAID 41 can survive multiple disk failures. In fact, under some circumstances it can loose more than 50% of the disks and still operate without loss of data. In most configurations, a RAID 41 can recover from the loss of at least any 2 disks and often more. Some drawbacks to RAID 41 are: it requires lots of disks (minimum 6), and low space utilization. The space efficiency of RAID 41 will never achieve 50%. RAID 41 has similar performance characteristics to RAID 4.
ECC technology is used within disks to determine and correct read errors. The common ECC technology used today is derived from Reed-Solomon codes.
There is a little known variant of these error correcting codes known as erasure codes, or REED-Solomon Erasure Code-based RAID (RS-RAID). These codes do not have the ability to detect an error; they simply recover the error once it is detected. In essence, they recover “erased” data. The value of these codes is that one can create a RAID-like array which contains n data disks and m “parity” disks. This array can survive the failure of any combination of m disks.
FIG. 3 provides a graph showing the overall storage efficiency for different RAID configurations over a reasonable range of array sizes. This section provides some explanation of this graph. RAID 0 has no overhead so it is always 100% efficient. RAID 1 mirrors the same data on more and more disks so its efficiency goes down as more disks are added. RAID 4 and RAID 5 have a single parity disk's worth of overhead so this grows proportionally smaller as the number of disks is increased. RAID 10 requires an even number of disks so odd disks are assumed to be spares (hence the “zigzag”). RAID 41 similarly requires even numbers of disks so odd disks are considered spares. RS-RAID can have any number of parity disks, and is plotted with m=3 so that the RS-RAID configuration can survive 3 failures. If m were set equal to 1, the curve would have been the same as RAID 4/5.
In view of the foregoing, it would be desirable to provide a file system using a RAID configuration with large numbers of disks (for storage efficiency) while writing stripes (to avoid the parity bottleneck) and which can grok (i.e., adapt to) the addition of disks to the end of the stripe (for easy expansion). The file system would be able to provide the following features: very high write speeds; very high parallel read speeds; selectably high reliability; easy expansion (one disk at a time if desired); high capacity (lots of disks add up quickly); and excellent storage utilization.
File System Operations
File systems provide an important abstraction layer. They convert raw sectors into files and directories (or “folders”). The functionality, performance and limitations of a given file system are the product of the underlying design of the file system.
1. Traditional Block Oriented File Systems
Early file systems were designed to run on relatively small machines, often with as little as 4K of memory. Their file services were necessarily limited and the file system designs placed simplicity and reliability at a premium. Furthermore, early disk drives were typically only a handful of megabytes so scalability was often unimportant.
One of the early simplifying concepts was the use of blocks of storage instead of sectors. A block is the smallest unit of storage managed by the file system. In some cases a block is a sector but in most cases a block is a power of 2 sectors. Some file systems use blocks as large as 128 sectors (64K). Almost no file system uses blocks smaller than a sector due to the complexity of blocking/deblocking contents into sectors. The most common block size is 8K with 4K and 16K being less popular. Typically, file systems would implement an internal abstraction of a volume as a collection of blocks numbered from 0 to m−1 covering the entire volume.
2. Journaling File Systems
Journaling is actually a very simple concept. As file system modifications are fed into the buffer cache, the file system builds a journal of the changes. This journal is effectively a recipe for changing the file system from its current state to the proper state with the changes made. As the system has time and available disk bandwidth, it can execute the journal keeping the disk more-or-less up to date. If the write load becomes too heavy, the journal grows faster than it can be retired. During relative lulls in activity, the journal shrinks until it is empty.
A number of optimizations are possible in the journaling file system design. It is possible to optimize a journal by suppressing redundant writes—only the last write to a given location need be executed. It is possible to order writes such that a volume is up to date after a single pass through the disk—dramatically decreasing seek times. Some journaling implementations only journal metadata changes, while others journal everything.
3. Transaction Logging File Systems
Transaction logging file systems (TLFS) are based upon a different approach to file management. However, for motivation, a TLFS can be viewed as a journaling file system with a huge journal which never gets around to updating the block file system. The classic TLFS is LFS in the Sprite operating system.
It would be desirable to provide a TLFS that has the following features:                a Dynamic expansion—the ability to add storage to the file system at any time without complex preparation or even bringing the file system off line.        High speed writes—the ability to optimize writes to be 100% sequential and stripe-sized so as to tap the full write bandwidth of an RS-RAID array.        Undeletion or versioning of files—the ability to “go back in time” to a previous state in the file.        Self-healing—The ability to isolate failed disks and recover to the degree that little performance is lost and that additional disk failures can be endured under similar conditions.        
The present invention provides such a file system by use of generalized object storage technology.