This invention is related to non-volatile, mass electronic storage systems, also known as secondary storage systems.
Secondary storage systems are widely used with different types of client applications including on-line transaction processing and multimedia storage. Transaction processing includes for instance credit card processing which is characterized by the client requesting a relatively large number of small data transactions. In contrast, multimedia storage such as video and music file access generally requires significantly larger transactions. In a typical operation, the client application sends a high level request for reading or writing a file to a file system. The file system maintains a file system name space, and maps file reads/writes to lower level block accesses. These block accesses are then fed to a storage engine that is typically part of a secondary storage system that includes a rotating magnetic disk drive. The storage engine typically has no knowledge of whether the file system is using a particular block in the disk drive, and as such can be described as being independent of the file system.
A currently popular technique for implementing a high performance, large capacity, and low cost secondary storage system is the Redundant Array of Inexpensive Disks (RAID). In a RAID, a set of rotating magnetic disk drives (referred to here as simply xe2x80x9cdisksxe2x80x9d) are organized into a single, large, logical disk. Each disk in the set typically has the same number of platters and the same number of tracks on each platter where the data is actually stored. The data is xe2x80x9cstripedxe2x80x9d across multiple disks to improve read/write speeds, and redundant information is stored on the disks to improve the availability of data (reliability) in the event of catastrophic disk failures. The RAID secondary storage system can typically rebuild the failed data disk, without involving the file system, by regenerating each bit of data in each track and platter of the failed disk (using its knowledge of the redundant information), and then storing each such bit in corresponding locations of a new, replacement disk.
Several RAID architectures (known in the industry as xe2x80x9clevelsxe2x80x9d) have been developed to provide various combinations of cost, performance, and reliability. For instance, in a Level I RAID, a block of data received from the file system by an input/output (I/O) controller of the RAID is stored entirely in one disk and replicated in a check disk. Level I thus uses twice as many disks as a nonredundant disk array, but provides speedy access (either disk by itself may be used to retrieve the block) and high availability. Accordingly, Level I is frequently used with applications such as on-line transaction processing where availability and transaction rate are more important than storage capacity efficiency.
In contrast, in a Level III RAID architecture, the block of data is spread bit-wise over a number of data disks, and a single parity disk is added to tolerate any single disk failure. The use of parity rather than a replicate of the data lowers the availability in comparison with a Level I architecture. However, storage efficiency is greatly improved as only one parity disk is used for several data disks. Thus, Level III may particularly suit applications that require the storage of large amounts of data and high throughput where data is accessed sequentially most of the time, as in digital video file storage.
There are two problems with the above described RAID architectures. First, as the storage capability of the typical disk drive steadily increases, disk rebuild times also increase. Since, while rebuilding a failed disk, the storage system is unprotected, i.e. a second disk failure implies the total failure of storage system, longer rebuild times can become a serious problem. This problem becomes even greater as the likelihood of failure increases with larger RAID sets having greater numbers of disk drives.
Another problem is reduced read/write performance with applications such as television production, where large transactions involving requests to retrieve or store media files, such as television program and commercial files, are combined with smaller transactions that access text files or xe2x80x9cmetadataxe2x80x9d files which describe the commercial or program contained in a particular media file. Although RAID Level I provides speedy access for both large and small transactions, duplicating the large media files makes inefficient use of the total storage space as compared to that which can be obtained using Level III. Performance using RAID Level III, however, suffers for small, randomly addressed transactions due to the need to access the data in a random fashion over several disks rather than just one.
An embodiment of the invention described below benefits from the concept of closely coupling a fault tolerant, mass storage engine (such as a RAID engine) with a file system to achieve greater overall throughput in storage/database applications having a mix of large, sequential access transactions and small, random access transactions. In addition, disk rebuild time may be greatly reduced using such an embodiment.
A method according an embodiment of the invention includes dividing a logical storage space representing a storage area in a set of non-volatile storage devices into nonoverlapping storage allocation units (SAUs), each SAU to overlay all devices in the set. Different fault tolerant storage methodologies (FTSMs) are assigned to access (i.e. read/write) data in the different SAUs, respectively. Access to the data is done based on the particular FTSM for the SAU that is being accessed.
In a particular embodiment, an allocation table can be shared by both the file system and the RAID engine by virtue of the table being made public to the RAID engine. This allows the file system to chose the optimal fault tolerant storage methodology for storing a particular data stream or data access pattern, while simultaneously allowing the RAID engine to properly recover the data should a disk fail (by referring to the allocation table to determine which fault tolerant storage methodology was used to store the file.) Also, when the RAID engine rebuilds a failed disk, only the SAUs that are indicated in the allocation table as being used by the file system are rebuilt, thus saving rebuild time which is particularly advantageous when large capacity individual disk drives are being used.