1. Field of the Invention
The present invention relates to a method of conserving power consumed by data storage devices.
2. Description of the Related Art
Computer systems today are designed for optimizing application performance and to conserve power when the application load is reduced. Disk drives often form an integral portion of the server subsystem and are a key element in driving the performance needs of an application. On the other hand, disk drives also consume a significant portion of power in a server. So, there is a need to conserve disk power without compromising the performance of the storage subsystem.
A hard disk includes platters, read/write heads, a spindle and spindle motor, an actuator and actuator assembly, cache, and a controller. The platters are round, flat disks coated on both sides with a magnetic material designed to store data. The data on a platter is organized and stored in the form of concentric circles called “tracks” and each track is divided into “sectors.” A read/write head is typically associated with each side of each platter. During read operations, the heads read the data stored in the magnetic material of the platters to electrical signals. They perform the reverse operation during data writes. Only one head can read from or write to the hard disk at a given time.
The platters are mounted onto a spindle and rotated by a spindle motor, such as a servo-controlled DC spindle motor. Increasing the speed of the spindle motor, allows the platters to spin faster and causes more data to be read by the heads in a shorter time. The read/write heads that access the platters are mounted on an actuator assembly and the assembly is connected to an actuator. The actuator holds the heads in an exact position relative to the platter surfaces that are being read and also moves the heads from track to track to allow access to the entire surface of the disk. The actuator in a typical modern hard disk uses a voice coil to move the head arms in and out over the surface of the platters, and a servo system to dynamically position the heads directly over the desired data tracks.
Modern hard disks contain an integrated cache that serves as a buffer between the fast requesting application and the slow hard disk. For hard disks, the cache is used to hold the results of recent reads from the disk, and also to “pre-fetch” information that is likely to be requested in the near future. The hard disk's cache is important because of the difference in the speeds of the hard disk and the hard disk interface. On a typical IDE/ATA hard disk, transferring a block of data from the disk's internal cache is over 100 times faster than actually finding it and reading it from the platters.
Reading data from the hard disk is generally done in blocks of various sizes, not just one 512-byte sector at a time. The cache is broken into “segments”, or pieces, each of which can contain one block of data. When a request is made for data from the hard disk, the cache circuitry is first queried to see if the data is present in any of the segments of the cache. If it is present, it is supplied to the logic board without access to the hard disk's platters being necessary. If the data is not in the cache, it is read from the hard disk, supplied to the controller, and then placed into the cache in the event that the data is needed again soon. Since the cache is limited in size, there are only so many pieces of data that can be held before the segments must be recycled. Typically the oldest piece of data is replaced with the newest one. While obviously improving performance, the cache does not help for random read operations.
With no write caching, every write to the hard disk involves a performance hit while the system waits for the hard disk to access the correct location on the hard disk and write the data. This mode of operation is called write through caching. If write caching is enabled, when the system sends a write to the hard disk, the logic circuit records the write in its much faster cache, it immediately send a completion notification to the hard disk. This is called write-back caching, because the data is stored in the cache and only “written back” to the platters later on. Write-back functionality improves performance but at the expense of reliability. Since the cache is not backed up by a battery, a power failure will lead to loss of the data in the cache. Due to this risk, in some situations write caching is not used at all.
The disk drive's controller includes a microprocessor and internal memory, and other structures and circuits that control what happens inside the drive. The controller is responsible for controlling the spindle motor speed, controlling the actuator movement to various tracks, managing all read/write operations, managing the internal write back cache and prefetch features, and implementing any advanced performance and reliability features. In operation, a computer's operating system requests a particular block of data from a disk driver. The driver then sends the request to the hard disk controller. If the requested data is already in the cache, then the controller provides the data without accessing the disk. However, if the requested data is not in the cache, then the controller identifies where the requested data resides on the disk drive, normally expressed as a head, track and sector, and instructs the actuator to move the read/write heads to the appropriate track. Finally, the controller causes the head to read the identified sector(s) and coordinates the flow of information from the hard disk over the hard disk interface usually to the system memory to satisfy the system's request for data.
A desktop or laptop computer may conserve power by causing their disk drives to “hibernate” or turn off whenever an end user application is not requesting data. However, it is critical for many servers to provide continuous and responsive access to data stored on disks.
In fact, the need for data reliability and input/output performance has led to the development of a series of data storage schemes that divide and replicate data among multiple data storage devices. The storage schemes are commonly referred to as Redundant Arrays of Independent Disks (RAID). RAID combines physical data storage devices, such as hard disk drives, into a single logical unit by using either special hardware or software. Hardware solutions often are designed to present themselves to the attached system as a single device or drive, and the operating system is unaware of the technical workings of the underlying array. Alternative software solutions are typically implemented in the operating system, and would similarly present the RAID drive as a single device or drive to applications. The minimum number of drives and the level of data reliability depend on the type of RAID scheme that is implemented.
Originally there were five RAID levels, but variations, nested levels and nonstandard levels have evolved. Different RAID levels use one or more techniques referred to as mirroring, striping and error correction. Mirroring involves the copying of data to more than one disk, striping involves the splitting of data across more than one disk, and error correction involves storing redundant data to allow problems to be detected and possibly fixed.
For example, a RAID-5 array uses block-level striping with parity data distributed across all member disks. RAID 5 has achieved popularity due to its low cost of redundancy. Generally, RAID 5 is implemented with hardware support for parity calculations. A minimum of 3 disks is generally required for a complete RAID 5 configuration. RAID-5 offers an optimal balance between price and performance for most commercial server workloads. RAID-5 provides single-drive fault tolerance by implementing a technique called single equation single unknown.
A stripe comprises adjacent blocks of data stored on different devices or drives comprising the array. In Table 1, below, blocks 1, 2, 3 and P1 make up the 1st stripe; blocks 4, 5, P2 and 6 make up the 2nd stripe; and so on. Every stripe has an associated parity P which is computed based on the blocks within the same stripe. The RAID-5 controller calculates a checksum (parity) using a logic function known as an exclusive-or (XOR) operation. The checksum is the XOR of all of the other data elements in a row. The XOR result can be performed quickly by the RAID controller hardware and is used to solve for the unknown data element. Since Block P1=Block 1 xor Block 2 xor Block 3, any one of the four blocks in a stripe can be retrieved if is it missing. For example, Block 1 can be computed using blocks 2, 3 and P1.
TABLE 1Example 4-drive RAID 5 array
For performance reasons, the computation of parity is optimized by the following two methods. In accordance with a Read/Modify Write (RMW) operation, if an application intends to modify block 1, then only block 1 and parity P1 are read. The new parity P1′ is computed and the new block 1′ and new parity P1′ are written back to the drives. This results in two reads and two write operations and the number of operations are independent of the number of drives in the array. In accordance with a Full Parity or Full Checksum (FC) operation, if an application intends to modify block 1, then block 2 and block 3 are read. The new parity P1′ is computed and the new block 1′ and new parity P1′ are written back to the drives. The number of write operations is two and is independent of the number of drives. The number of read operations is dependent on the number of drives in the array (# drives—2).
A significant benefit of RAID-5 is the low cost of implementation, especially for configurations requiring a large number of disk drives. To achieve fault tolerance, only one additional disk is required. The checksum information is evenly distributed over all drives, and checksum update operations are evenly balanced within the array.
During read operations, the parity blocks are not read since this would be unnecessary overhead and would diminish performance. The parity blocks are read, however, when a read of a data sector results in a CRC error (each sector is protected by a CRC as well). In this case, the sector in the same relative position within each of the remaining data blocks in the stripe and within the parity block in the stripe are used to reconstruct the errant sector. The CRC error is thus hidden from the main computer. Likewise, should a disk fail in the array, the parity blocks from the surviving disks are combined mathematically with the data blocks from the surviving disks to reconstruct the data on the failed drive “on the fly”. However, in RAID 5, where there is a single parity block per stripe, the failure of a second drive results in total data loss.
RAID 5E extends RAID 5 capability by integrating a hot spare drive into the array. A hot spare drive is a backup drive that supplements a failed drive in an array. RAID 5E does not treat the hot spare drive as just a backup drive, instead it pulls the hot spare drive into the active array for performance. It “stripes” the spare drive across the whole array. This technique enhances the performance as more drives equate to better performance. It also allows two drives to fail without loss of data.
TABLE 2Example 4-drive RAID 5E array
Note that RAID 5E requires a portion of each drive to be dedicated to the spare even when all drives are operating normally. In the case that one of the drives fails, the whole array is rebuilt. For example, Table 3 illustrates data on the array at the time Drive 3 fails, and Table 4 shows the rebuilt array.
TABLE 3Example 4-drive RAID 5E array with a failed drive
TABLE 4Example of a rebuilt RAID 5E array
RAID 5EE is similar to RAID 5E in that it uses a hot spare and uses the hot spare for performance. But unlike RAID 5E, it stripes the hot spare across the data arrays just like parity. This allows quick rebuild times unlike RAID 5E.
TABLE 5Example of a 4-drive RAID 5EE array
However, despite the data reliability and performance brought about by the use of composite arrays of data storage devices, such as the foregoing RAID systems, there remains a need to conserve the amount of power consumed by these systems. In particular, the benefits of operating these systems come at the expense of operating an extra data storage device. It would be desirable to conserve power in the operation of a composite array while retaining the beneficial attributes of data reliability and performance.