1. Field of the Invention
This invention relates in general to systems and methods for controlling an array of disk drives in a computer system, and more particularly to a method and apparatus for providing battery-backed immediate write back cache for an array of disk drives in a computer system.
2. Description of Related Art
Modern mass storage subsystems are continuing to provide increasing storage capacities to fulfill user demands from host computer system applications. Due to this critical reliance on large capacity mass storage, demands for enhanced reliability are also high. Various storage device configurations and geometries are commonly applied to meet the demands for higher storage capacity while maintaining or enhancing reliability of the mass storage subsystems.
A popular solution to these mass storage demands for increased capacity and reliability is the use of multiple smaller storage modules configured in geometries that permit redundancy of stored data to assure data integrity in case of various failures. In many such redundant subsystems, recovery from many common failures can be automated within the storage subsystem itself due to the use of data redundancy, error codes, and so-called xe2x80x9chot sparesxe2x80x9d (extra storage modules which may be activated to replace a failed, previously active storage module). These subsystems are typically referred to as redundant arrays of inexpensive (or independent) disks (or more commonly by the acronym RAID). A number of reference articles that describe the design and characteristics of disk array subsystems have been published, including the articles: xe2x80x9cIntroduction to Redundant Arrays of Inexpensive Disks (RAID)xe2x80x9d by D. Patterson, P. Chen, G. Gibson and R. Katz, IEEE, 1989; xe2x80x9cCoding Techniques for Handling Failures in Large Disk Arraysxe2x80x9d by G. Gibson, L. Hellerstein, R. Karp, R. Katz and D. Patterson, Report No. UCB/CSD 88/477, December 1988, Computer Science Division, University of California Berkeley; and xe2x80x9cA Case Study for Redundant Arrays of Inexpensive Disks (RAID)xe2x80x9d by D. Patterson, G. Gibson, and R. Katz, presented at the June 1988 ACM SIGMOD Conference in Chicago, Ill.
Generally speaking, a disk array subsystem includes an array of standard disk drives, referred to collectively as a xe2x80x9ccompositexe2x80x9d drive, coupled in parallel. The disk array subsystem further includes a drive array controller for interfacing the composite drive to a computer system. The drive array controller, which is generally installable on an expansion bus of the computer system, converts input-output (xe2x80x9cI/Oxe2x80x9d) read and write requests into a sequence of seeks, delays and other disk commands to read data from or write data to the composite drive.
A drive array controller differs from a conventional disk drive controller (i.e., a single disk controller) in that, with respect to the drive array controller, the set of disk drives coupled thereto emulate a single disk drive having a greater capacity and a higher performance than any individual disk drive included as a portion thereof. To perform an access to a virtual composite drive location within the composite drive, the drive array controller must be cognizant of both the position of the particular disk drive to be accessed as well as the physical sector location within that disk drive which corresponds to the virtual composite drive location for which access is sought. Various hardware and software implementations are well-known for performing these functions.
A significant concern relating to the mass storage of data within disk array subsystems is the possibility of data loss or corruption due to drive failure. A variety of data redundancy and recovery techniques have therefore been proposed to allow restoration of data in the event of a drive failure.
There are several xe2x80x9clevelsxe2x80x9d of standard geometries defined in the Patterson publication. For example, a RAID level 1 system, comprises one or more disks for storing data and an equal number of additional xe2x80x9cmirrorxe2x80x9d disks for storing copies of the information written to the data disks. Additional RAID levels, such as RAID level 2,3,4 and 5 systems, segment the data into portions for storage across several data disks. One of more additional disks are utilized to store error check or parity information.
RAID storage subsystems typically utilize a control module that shields the user or host system from the details of managing the redundant array. The controller makes the subsystem appear to the host computer as a single, highly reliable, high capacity disk drive. In fact, the RAID controller may distribute the host computer system supplied data across a plurality of the small independent drives with redundancy and error checking information so as to improve subsystem reliability. Frequently RAID subsystems provide large cache memory structures to further improve the performance of the RAID subsystem. The cache memory is associated with the control module such that the storage blocks on the disk array are mapped to blocks in the cache. This mapping is also transparent to the host system. The host system simply requests blocks of data to be read or written and the RAID controller manipulates the disk array and cache memory as required.
To further improve reliability, it is known in the art to provide redundant control modules to reduce the failure rate of the subsystem due to control electronics failures. In some redundant architectures, pairs of control modules are configured such that they control the same physical array of disk drives. A cache memory module is associated with each of the redundant pair of control modules.
The redundant control modules communicate with one another to assure that the cache modules are synchronized. Typically, the redundant pair of control modules communicate at their power-on initialization (or after a reset operation). While the redundant control modules complete their communications to assure synchronization of the cache modules, the RAID storage subsystem are unavailable with respect to completing host computer requests. If the cache modules are xe2x80x9cout of syncxe2x80x9d, the time required to restore synchronization could be significant. In addition, a failure of one of the redundant pair of control modules would further extend the time during which the RAID storage subsystem would be unavailable. Manual (operator) intervention could be required to replace a defective redundant control module in order for the RAID subsystem to begin processing of host computer requests.
During normal operation, dual controllers operate in a write back mode. Write back mode refers to the process of writing data in a receiving controller""s cache and then writing data in the other controller""s cache before returning a completion status to the host. For dual controller systems with battery-backed data cache, when a controller fails, a replacement controller is installed. The replacement controller needs to recondition its battery before the write back cache mode is reinitiated. This is because the cache is not yet protected from a power failure. During the reconditioning period, the operation of the controller is in a write through cache mode so the data is committed to the storage media before a completion status is returned to the host. Once the battery is reconditioned the controller can change back to the write back cache mode. However, traditional replacement controllers need to condition the batteries attached to them for many hours before write back cache operations can take place. Further, the write back cache mode returns a completion status to the host much faster than the write through cache mode. Thus, the write through cache mode causes a decrease in performance for the many hours it takes to recondition the battery.
Reconditioning is required because a newly installed controller does not know the state of the battery. Reconditioning involves draining the battery down to zero charge, charging it up fully so that the controller knows how long it takes to charge the battery, bring the battery down to zero charge so the controller knows how long the battery will be able to hold the charge for the cache memory and fully recharging the battery again. Once that is known, then the controller will allow the data to be stored live in the cache and the conservative write through cache mode is changed back to the write back cache mode, which provides better performance. The battery reconditioning process generally takes 12-16 hours, during which, although they would have two controllers, the performance of the controllers would be severely degraded because the write through cache mode is used so that all of the data is written to the physical media on the disk before a successful status was returned to the host.
It can be seen that there is a need for a method and apparatus for providing battery-backed immediate write back cache for an array of disk drives in a computer system.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method and apparatus for providing battery-backed immediate write back cache for an array of disk drives in a computer system.
The present invention solves the above-described problems by enabling cooperation between a new replacement controller and the survivor controller to have write back cache operation start immediately, and not be dependant on the battery condition in the replacement controller. Protection of the data through a single point of failure is maintained. Thus, there is no waiting period and the higher speed write back cache method can be used as soon as the controller is ready to accept commands.
A method in accordance with the principles of the present invention includes: a) operating a first and a second controller in a write back cache mode when the first and second controller are in a normal state and each controller comprises a processor, a battery backup and a cache memory, b) switching the second controller to a write through cache mode when the first controller fails, c) replacing the first controller that failed with a replacement controller, d) exchanging state information regarding the battery backup for the replacement controller and the second controller, e) determining whether either battery backup of the replacement controller and the second controller meets a predetermined threshold and f) running the replacement controller and the second controller in a write back cache mode when the battery backup for the replacement controller or the second controller meets the predetermined threshold.
Other embodiments of a method in accordance with the principles of the invention may include alternative or optional additional aspects. One such aspect of the present invention is that the method further includes: g) running the replacement controller and second controller in a write through cache mode when the battery backup for the replacement controller and the second controller fail to meet the predetermined threshold.
Another aspect of the present invention is that the method further includes: h) determining whether the battery backup for the replacement controller needs reconditioning and i) reconditioning the battery backup for the replacement controller when the battery backup for the replacement controller needs reconditioning.
Another aspect of the present invention is that the method further includes switching the operation of the second controller to a write through cache mode when the replacement controller fails before the reconditioning for the battery backup for the replacement controller is completed.
Another aspect of the present invention is that the method further includes reinstalling the failed replacement controller and repeating a-i.
Another aspect of the present invention is that the method further includes switching the operation of the replacement controller to a write through cache mode when the second controller fails before the reconditioning for the battery backup for the replacement controller is completed.
Another aspect of the present invention is that the method further includes installing a second replacement controller for the failed second controller and repeating a-i.
In another embodiment of the present invention, a disk array system is provided. The disk array system includes a first and a second controller for receiving data write operations from a host, the first and second controller each further comprising a processor, a battery backup and a cache memory, the cache memory being maintained by the battery backup and an array of disks coupled to the first and second controller, wherein the processors for the first and second controller are each configured with a state machine for cooperatively sharing a state for their associated battery backup and for storing a state of the battery backup for the other controller so that the state machine for each controller knows the state of its own battery backup and the state of the battery backup of the other controller, wherein the processor operates the first and the second controller in a write back cache mode when the first and second controller are in a normal state, the processor for the second controller switching to a write through cache mode when the first controller fails and a replacement controller is installed for the failed first controller, wherein the replacement controller and the second controller share state information, and wherein the processors for the second and replacement controller determine whether either battery backup of the replacement controller and the second controller meets a predetermined threshold and run the replacement controller and the second controller in a write back cache mode when the battery backup for the replacement controller or the second controller meets the predetermined threshold.
Another aspect of the present invention is that the processor for the replacement controller and the second controller run the replacement controller and second controller in a write through cache mode when the battery backup for the replacement controller and the second controller fail to meet the predetermined threshold.
Another aspect of the present invention is that the processor for the replacement controller determines whether the battery backup for the replacement controller needs reconditioning and begins reconditioning the battery backup for the replacement controller when the battery backup for the replacement controller needs reconditioning.
Another aspect of the present invention is that the processor for the second controller switches the operation of the second controller to a write through cache mode when the replacement controller fails before the reconditioning for the battery backup for the replacement controller is completed.
Another aspect of the present invention is that the processor for the replacement controller switches the operation of the replacement controller to a write through cache mode when the second controller fails before the reconditioning for the battery backup for the replacement controller is completed.
In another embodiment of the present invention, an article of manufacture is provided. The article of manufacture includes a program storage medium readable by a computer, the medium tangibly embodying one or more programs of instructions executable by the computer to perform a method for operating a first and a second controller within a disk array system, wherein each controller comprises a processor, a battery backup and a cache memory, the method includes operating a first and a second controller in a write back cache mode when the first and second controller are in a normal state and each controller comprises a processor, a battery backup and a cache memory, switching the second controller to a write through cache mode when the first controller fails, replacing the first controller that failed with a replacement controller, exchanging state information regarding the battery backup for the replacement controller and the second controller, determining whether either battery backup of the replacement controller and the second controller meets a predetermined threshold and running the replacement controller and the second controller in a write back cache mode when the battery backup for the replacement controller or the second controller meets the predetermined threshold.
Another aspect of the present invention is that the method further includes running the replacement controller and second controller in a write through cache mode when the battery backup for the replacement controller and the second controller fail to meet the predetermined threshold.
Another aspect of the present invention is that the method further includes determining whether the battery backup for the replacement controller needs reconditioning and reconditioning the battery backup for the replacement controller when the battery backup for the replacement controller needs reconditioning.
Another aspect of the present invention is that the method further includes switching the operation of the second controller to a write through cache mode when the replacement controller fails before the reconditioning for the battery backup for the replacement controller is completed.
Another aspect of the present invention is that the method further includes reinstalling the failed replacement controller and repeating a-i.
Another aspect of the present invention is that the method further includes switching the operation of the replacement controller to a write through cache mode when the second controller fails before the reconditioning for the battery backup for the replacement controller is completed.
Another aspect of the present invention is that the method further includes installing a second replacement controller for the failed second controller and repeating a-i.
These and various other advantages and features of novelty which characterize the invention are pointed out with particularity in the claims annexed hereto and form a part hereof. However, for a better understanding of the invention, its advantages, and the objects obtained by its use, reference should be made to the drawings which form a further part hereof, and to accompanying descriptive matter, in which there are illustrated and described specific examples of an apparatus in accordance with the invention.