1. Field of the Invention
The present invention relates generally to Redundant Arrays of Inexpensive Disks and more particularly, the invention relates to technology for increasing the performance of these disk arrays.
2. Description of the Related Art
Disk arrays provide vast amounts of storage as well as flexibility in speed and reliability. These arrays are often configured to operate as Redundant Arrays of Inexpensive Disks, otherwise known as RAID arrays, to provide added speed and reliability. Existing RAID arrays, however, suffer from various deficiencies.
In order to assure reliability in the event of unexpected power failures, high-performance RAID arrays generally require a battery backup or uninterruptable power supply (UPS). Existing systems are typically configured such that the host is informed that a write has been completed once the write data has been written to a write cache of a disk drive or an array controller. The danger is that power can be lost after the data has been written to the cache but before the data has been written to disk. The battery backup allows cached write data to be maintained until power is restored so that the cached write can be completed. An UPS allows extra time for a write to be completed before power is finally lost. An UPS or a battery backup, however, adds substantially to the initial cost of a RAID array.
As an alternative to using a battery backup, some systems disable write caching. With write caching disabled, the host is informed that a write has been completed only after the write data is actually written to the disk media. This approach is safe, but has poor performance. With the write cache off, when a write is sent to the drive, the drive must wait for the correct location to be under the write head, then complete the write operation and return the completion interrupt. When writes are sequential, the next write arrives just after the correct sector has passed under the head, and an entire disk revolution can be lost. Empirical measurements of a Quantum LM drive have shown that the streaming write performance (with 64 KB writes) drops to 5.9 MB/s with write cache disabled compared to 25.6 MB/s with write cache enabled. Even with a battery backed up external write cache, there is no way to recover the lost performance because there is no queuing of writes at the drive.
The consistency of a RAID array that lacks an UPS or battery backup may also be compromised if a power failure occurs during a write operation. The consistency problem occurs when only a portion of a write operation is completed before a power failure, which may leave portions of an array out of synchronization. This consistency problem occurs even if all write caching is permanently disabled. Existing RAID controllers handle this situation in different ways. Some low cost storage systems just ignore the problem, in which case an UPS may be required in order to ensure system reliability. Other systems detect an xe2x80x9cunclean shutdownxe2x80x9d by setting a bit in nonvolatile storage (e.g., on a disk) during normal operation and clear the bit when the operating system shuts down normally. If power is lost, the unclean-shutdown bit remains set and can be used to initiate a rebuild of the entire RAID array. The rebuild of the array restores consistency to the array by recreating a mirror disk (RAID 1) or a parity disk (RAID 5), for example. A rebuild of the array, however, is a typically a time consuming operation. For example, a full rebuild of a RAID 10 array with eight 75 GB drives can take on the order of 2.7 hours at 30 MB/s. Furthermore, during the rebuild, if one of the drives of the array fails, data can be permanently lost.
The disk drives used in RAID arrays, like all disk drives, are also susceptible to failure. In the case of a drive failure, previous RAID controllers have been configured to rebuild the entire failed drive based upon the data in the remaining drives. This rebuild process is generally time consuming, and the array is susceptible to data loss in the case another drive fails during the rebuild process.
When initializing or creating a RAID array unit, all of the disks of the array are typically zeroed before the array can be used. Zeroing involves writing zeros to all of the storage locations on the array. The zeroing process is performed in order to create an array that is compliant with the RAID standard under which the array is operating. Depending on the particular RAID array, this zeroing process during unit creation can sometimes take several hours.
The present invention seeks to address these problems among others.
In one aspect of the invention, write performance is increased over systems that disable the write cache. In order to improve write performance, the write cache is enabled, but completion interrupts are deferred and queued in a Pending Completion Write Queue until a flush of the write cache is completed. After a flush has been completed, it can be assured that any cached data has been written to disk. A completion interrupt is not sent to the host system before the write command is actually completed and falsely transmitted completion interrupts just prior to power failures are avoided. The drive caches are therefore safely used to coalesce writes and increase performance.
In another aspect, rebuild times are improved. The address range of the array is divided up into a number of activity bins where each activity bin represents a subset of the address range of the disk drive. For each activity bin, activity data descriptive of disk activity targeted to the corresponding activity bin is preferably stored in a corresponding memory element in a binmap, which is maintained in a nonvolatile RAM (NOVRAM). Alternatively, the binmap can be maintained on one or more of the disk drives. A relatively small number of activity bins (e.g., 64 bins representing a drive""s entire range of addresses) can be used to log enough disk activity information to achieve substantial increases in performance during rebuilds.
In one embodiment, each activity bin can take on one of two states where the state is maintained by a bit in the binmap. An activity bin is set to a Changing state if at least one of the addresses in the activity bin is the target of a write operation that has been initiated but not completed. The bin is in a Stable state if no addresses are the target of an uncompleted write operation. After a power failure, activity bins in a Changing state are rebuilt and activity bins in a Stable state can be skipped. Rebuild times can be reduced drastically after a power failure if few bins are in the Changing state.
In one configuration, two binmaps are used to store state information for activity bins. Each time a write operation is received, the corresponding memory elements in each binmap are set to represent the Changing state. Periodically, after an amount of time that is longer than the longest time needed to complete a write operation, an alternate one of the bitmaps is reset such that all of the bins are in the Stable state. The bitmaps are therefore reset in alternating ping-pong fashion. Accordingly, at least one of the binmaps always contains at least a period""s worth of activity data that can be used to effect a rebuild of any Changing bins.
In another configuration, a single binmap is used. The binmap is cleared each time a cache flush is performed. A cache flush assures that all pending writes have been completed and therefore any Changing bins can be reset to Stable.
In one embodiment, two additional states are used to reduce array unit creation times. Upon creation of a unit, all of the activity bins are set to an empty state at which time the unit is brought on line. Accordingly, an array unit can be brought on line nearly instantaneously. In contrast to prior techniques, the zeroing of the whole array, which may take up to several hours, need not be performed before the array is brought on-line. Instead, activity bins can be zeroed on demand, before the first write to each bin, and in the background. As each bin is zeroed, it is set to the Zero state. Activity bins are set to Changing and Stable states in accordance with the above-mentioned embodiments.
Array rebuild times after a drive failure can also be reduced. In accordance with one embodiment, only bins in changing or stable states are rebuilt after a drive failure. Any bins in Empty or Zero states can be skipped during the rebuild process since these bins contain no valid data. It is often the case that a disk drive will contain valid data on a small portion of its complete address range. Accordingly, when a disk is only partially full, substantial reductions in rebuild times after drive failures can be achieved.
Activity bins can also be used to increase the performance of certain read operations. In the case an activity bin is in a Zero state, it is known that all of the data in the bin is equal to zero. Accordingly, zeros can be returned without performing a read of the disk media.
In one embodiment, activity bins are associated with display elements on a display configured to spatially indicate the relative locations of disk operations on disk drives.
In one embodiment, a RAID array includes two or more disk drives, a binmap, and a processor. The binmap includes two or more storage elements maintained in a nonvolatile memory and each storage element is associated with an activity bin. The processor is configured to operate the disk drives in a RAID configuration, map disk addresses to activity bins, and store disk activity data in the binmap.