1. Field of the Invention
The present invention relates generally to data storage systems, and more particularly to power-efficient, high-capacity data storage systems that are scalable and reliable.
2. Related Art
The need for large data storage motivates the need for building large-scale and high-capacity storage systems. While one option for building scalable systems is to connect and centrally manage multiple storage systems across a network, such as a storage area network (SAN), the inherent capacity increase in a single system is still highly desirable for two reasons: first, increasing total storage capacity in a single system in effect provide a multiplier effect for the total storage across a SAN; and second, for many uses providing a single device that manages larger capacity of storage is always more cost-effective in testing, integrating and deploying.
Traditionally, tape drives, automated tape libraries or other removable media storage devices have been used to deliver large capacity storage in a single system. This is due in large part to the lower cost and footprint of these types of systems when compared to media such as disk drives. Recent advances in disk technology, however, have caused designers to revisit the design of large scale storage systems using disk drives. There are two primary reasons for this. First, the cost differential between disk and tape devices on per unit storage is decreasing rapidly due to the higher capacity of disk drives available at effectively lower cost. Second, the performance of disk systems with respect to access times and throughput are far greater than tape systems.
Despite the falling cost of disk drives and their performance in throughput and access times, some tape drives still have the advantage of being able to support large numbers (e.g., ten or more) of removable cartridges in a single automated library. Because a single tape drive can access multiple tape volumes, equivalent storage on multiple disk drives will consume ore (e.g., ten times more) power than the equivalent tape drive systems, even with a comparable footprint. Furthermore, for a disk-based storage system that has the same number of powered drives as the number of passive cartridges in a tape system, the probability of failures increases in the disk storage system. It would therefore be desirable to provide a single high-capacity disk based storage system that is as cost effective as tertiary tape storage systems but with high reliability and greater performance.
Traditional RAID and Data Protection Schemes Issues
The dominant approach to building large storage systems is to use a redundant array of inexpensive (independent) disks (RAID). RAID systems are described, for example, in David A. Patterson, G. Gibson, and Randy H. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” International Conference on Management of Data (SIGMOD), p. 109–116, June 1988. The primary goal for RAID is to provide data protection or fault tolerance in access to data in the case of failures, especially disk failures. A secondary benefit is increasing I/O performance by spreading data over multiple disk spindles and performing operations in parallel, which allows multiple drives to be working on a single transfer request.
There are six commonly known RAID “levels” or standard geometries that are generally used for conventional RAID storage systems. The simplest array that provides a form of redundancy, a RAID level 1 system, comprises one or more disks for storing data and an equal number of additional mirror disks for storing copies of the information written to the data disks. The remaining RAID levels, identified as RAID level 2–6 systems, segment the data into portions for storage across several data disks. One of more additional disks is utilized to store error check or parity information.
RAID storage subsystems typically utilize a control module that shields the user or host system from the details of managing the redundant array. The controller makes the subsystem appear to the host computer as a single, highly reliable, high capacity disk drive even though a RAID controller may distribute the data across many smaller drives. Frequently, RAID subsystems provide large cache memory structures to further improve the performance of the subsystem. The host system simply requests blocks of data to be read or written and the RAID controller manipulates the disk array and cache memory as required.
The various RAID levels are distinguished by their relative performance capabilities as well as their overhead storage requirements. For example, a RAID level 1 “mirrored” storage system requires more overhead storage than RAID levels 2–5 that utilize XOR parity to provide requisite redundancy. RAID level 1 requires 100% overhead since it duplicates all data, while RAID level 5 requires 1/N of the storage capacity used for storing data, where N is the number of data disk drives used in the RAID set.
Traditional Power Consumption Issues
There have been a few recent efforts at power cycling computing resources at a data center. This is done for a variety of different reasons, such as energy cost and reliability. For example, a data storage system may be scaled upward to incorporate a very large number of disk drives. As the number of disk drives in the system increases, it is apparent that the amount of energy required to operate the system increases. It may be somewhat less apparent that the reliability of the system is likely to decrease because of the increased heat generated by the disk drives in the system. While prior art systems use various approaches to address these problems, they typically involve opportunistically powering down all of the drives in the system, as demonstrated by the following examples.
To reduce energy costs in a data center, one approach employs energy-conscious provisioning of servers by concentrating request loads to a minimal active set of servers for the current aggregate load level (see Jeffrey S. Chase, Darrell C. Anderson, Parchi N. Thakar, Amin M. Vahdat, and Ronald P. Doyle. Managing energy and server resources in hosting centers. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, pages 103–116, October 2001). Active servers always run near a configured utilization threshold, while the excess servers transition to low-power idle states to reduce the energy cost of maintaining surplus capacity during periods of light load. The focus is on power cycling servers and not on storage. Chase, et al. mention that power cycling may reduce the life of the disks, but current disks have a start/stop limit that will likely not be exceeded.
Another approach uses a large-capacity storage system which is referred to as a massive array of idle disks, or MAID (see Dennis Colarelli, Dirk Grunwald and Michael Neufeld, The Case for Massive Arrays of Idle Disks (MAID), Usenix Conference on File and Storage Technologies (FAST), January 2002, Monterey Calif.). In this approach, a block level storage system uses a front-end cache and controller that allow access to the full array of drives. The full array can be powered off opportunistically to extend the life of IDE or ATA drives. The power off schedule is based on a heuristic, such as a least-recently-used or least expected to be used model, i.e., the array of drives is turned off when no data access is expected on any of the drives in the array. Another approach uses archival storage systems where ATA drives are also powered off (as in the case of MAID) based on the algorithms similar to the LRU policy (see Kai Li and Howard Lee, Archival data storage system and method, U.S. Patent Application # 2002-0144057, Oct. 3, 2002). In some systems, the array of drives comprises a RAID set. In these systems, the entire RAID set is opportunistically powered on or off (see, e.g., Firefly™ Digital Virtual Library, http://www.asaca.com/DVL/DM—200.htm). These systems can powder down a RAID set that has been in an extended state of inactivity, or power up a RAID set for which I/O requests are pending.
Systems with Very Large Numbers of Drives
One of the challenges that exists in the current data storage environment is to build a storage controller that can handle hundreds of drives for providing large-scale storage capacity, while maintaining performance and reliability. This challenge encompasses several different aspects of the system design: the system reliability; the interconnection and switching scheme for control of the drives; the performance in terms of disk I/O; and the cost of the system. Each of these aspects is addressed briefly below.
System Reliability.
As the number of operational drives increases in the system, especially if many drives are seeking for data concurrently, the probability of a drive failure increases almost linearly with the number of drives, thereby decreasing overall reliability of the system. For example, if a typical disk drive can be characterized as having a mean time to failure (MTTF) of 500,000 hours, a system with 1000 of these drives will be expected to have its first disk fail in 500.5 hours or 21 days.
Interconnection and Switching Scheme for Control of Drives.
As the number of drives increases, an efficient interconnect scheme is required both to move data and to control commands between the controller and all of the drives. As used here, control of the drives refers to both controlling access to drives for I/O operations, and providing data protection, such as by using RAID parity schemes. There are two obvious challenges that arise in relation to the interconnection mechanism: the cost of the interconnection and the related complexity of fanout from the controller to the drives.
Performance for Disk I/O.
Since the controller will read and write data to and from all of the drives, the bandwidth required between the controller and the drives will scale with the number of active drives. In addition, there is the difficulty of RAIDing across a very large set, since the complexity, the extent of processing logic and the delay of the parity computation will grow with the number of drives in the RAID set.
Cost.
All of the above design issues must be addressed, while ensuring that the cost of the overall disk system can be competitive with typically lower cost tertiary tape storage devices.