1. Field of the Invention
The disclosed invention relates to RAID array controllers, and more particularly to a method and computer program product for backing up and restoring online system information.
2. Background Art
There are many applications, particularly in a business environment, where there are storage needs beyond those that can be fulfilled by a single hard disk, regardless of its size, performance, or quality level. Many businesses cannot afford to have their systems go down for even an hour in the event of a disk failure. They need large storage subsystems with capacities in the terabytes. In addition, they want to be able to insulate themselves from hardware and software failures to any extent possible. Some people working with multimedia files need fast data transfer exceeding what current drives can deliver, without spending a fortune on specialty drives. These situations require that the traditional “one hard disk per system” model be set aside and a new system employed. This technique is called Redundant Arrays of Inexpensive Disks or RAID. (“Inexpensive” is sometimes replaced with “Independent,” but the former term is the one that was used when the term “RAID” was first coined by the researchers at the University of California at Berkeley, who first investigated the use of multiple-drive arrays in 1987. See D. Patterson, G. Gibson, and R. Katz. “A Case for Redundant Array of Inexpensive Disks (RAID)”, Proceedings of ACM SIGMOD '88, pages 109-116, June 1988.
The fundamental structure of RAID is the array. An array is a collection of drives that is configured, formatted, and managed in a particular way. The number of drives in the array, and the way that data is split between them, is what determines the RAID level, the capacity of the array, and its overall performance and data protection characteristics.
It should be understood that RAID arrays are typically formed by partitioning one or more disks. Each partitioned area on the disk may be referred to as a storage device. One or more storage devices constitute the RAID array. Mirrored arrays are typically formed on two or more separate and distinct disks.
An array appears to the operating system to be a single logical hard disk. RAID employs the technique of striping, which involves partitioning each drive's storage space into units ranging from a sector (512 bytes) up to several megabytes. The stripes of all the disks are interleaved and addressed in order.
Most modern, mid-range to high-end disk storage systems are arranged as RAID configurations. A number of RAID levels are known. JBOD stands for Just a Bunch of Drives. The controller treats one or more disks or unused space on a disk as a single array. JBOD provides the ability to concatenate storage from various drives regardless of the size of the space on those drives. JBOD is useful in scavenging space on drives unused by other arrays. JBOD does not provide any performance or data redundancy benefits.
RAID 0, or striping, provides the highest performance but no data redundancy. Data in the array is striped (i.e. distributed) across several physical drives. RAID 0 arrays are useful for holding information such as the operating system paging file where performance is extremely important but redundancy is not.
RAID 1, or mirroring, mirrors the data stored in one physical drive to another. RAID 1 is useful when there are only a small number of drives available and data integrity is more important than storage capacity.
RAID 1n, or n-way mirroring, mirrors the data stored in one hard drive to several hard drives. This array type will provide superior data redundancy because there will be three or more copies of the data and this type is useful when creating backup copies of an array. This array type is however expensive, in both performance and the amount of disk space necessary to create the array type.
RAID 10 is also known as RAID (0+1) or striped mirror sets. This array type combines mirrors and stripe sets. RAID 10 allows multiple drive failures, up to 1 failure in each mirror that has been striped. This array type offers better performance than a simple mirror because of the extra drives. RAID 10 requires twice the disk space of RAID 0 in order to offer redundancy.
RAID 10n stripes multiple n-way mirror sets. RAID 10n allows multiple drive failures per mirror set, up to n−1 failures in each mirror set that has been striped, where n is the number of drives in each mirror set. This array type is useful in creating exact copies of an array's data using the split command. This array type offers better random read performance than a RAID 10 array, but uses more disk space.
RAID 5, also known as a stripe with parity, stripes data as well as parity across all drives in the array. Parity information is interspersed across the drive array. In the event of a failure, the controller can rebuild the lost data of the failed drive from the other surviving drives. This array type offers exceptional read performance as well as redundancy. In general, write performance is not an issue due to the tendency of operating systems to perform many more reads than writes. This array type requires only one extra disk to offer redundancy. For most systems with four or more disks, this is a desirable array type.
RAID 50 is also known as striped RAID 5 sets. Parity information is interspersed across each RAID 5 set in the array. This array type offers good read performance as well as redundancy. A 6-drive array will provide the user with 2 striped 3-drive RAID 5 sets. Generally, RAID 50 is useful in very large arrays, such as arrays with 10 or more drives.
Thus RAID or Redundant Array of Independent Disks is simply several disks that are grouped together in various organizations to improve either the performance or the reliability of a computer's storage system. These disks are grouped and organized by a RAID controller.
Each conventional RAID controller has a unique way to lay out the disks and store the configuration information. On the other hand, a system controlled by a common Operating System (OS) has a known format. When users try to add a RAID controller to their system, the most important task is to migrate the existing data disks to a RAID controlled system. The common operating system configuration format to control and communicate with a disk in the system is referred to as “metadata.” The OS metadata is different from the RAID controller's unique configuration format which is also referred to as “metadata.”
In the early days of RAID, fault tolerance was provided through redundancy. However, problems occurred in situations where a drive failed in a system that runs 24 hours a day, 7 days a week or in a system that runs 12 hours a day but had a drive go bad first thing in the morning. The redundancy would let the array continue to function, but in a degraded state. The hard disks were typically installed deep inside the server case. This required the case to be opened to access the failed drive and replace it. In order to change out the failed drive, the other drives in the array would have to be powered off, interrupting all users of the system.
If a drive fails in a RAID array that includes redundancy, it is desirable to replace the drive immediately so the array can be returned to normal operation. There are two reasons for this: fault tolerance and performance. If the drive is running in a degraded mode due to a drive failure, until the drive is replaced, most RAID levels will be running with no fault protection at all. At the same time, the performance of the array will most likely be reduced, sometimes substantially.
An important feature that allows availability to remain high when hardware fails and must be replaced is drive swapping. Strictly speaking, the term “drive swapping” simply refers to changing one drive for another. There are several types of drive swapping available.
“Hot Swap”: A true hot swap is defined as one where the failed drive can be replaced while the rest of the system remains completely uninterrupted. This means the system carries on functioning, the bus keeps transferring data, and the hardware change is completely transparent.
“Warm Swap”: In a so-called warm swap, the power remains on to the hardware and the operating system continues to function, but all activity must be stopped on the bus to which the device is connected.
“Cold Swap”: With a cold swap, the system must be powered off before swapping out the disk drive.
Another approach to dealing with a bad drive is through the use of “hot spares.” One or more additional drives are attached to the controller and are not used by I/O operations to the array. If a failure occurs, the controller can use the spare drive as a replacement for the bad drive.
The main advantage that hot sparing has over hot swapping is that with a controller that supports hot sparing, the rebuild will be automatic. The controller detects that a drive has failed, disables the failed drive, and immediately rebuilds the data onto the hot spare. This is an advantage for anyone managing many arrays, or for systems that run unattended.
Hot sparing and hot swapping are independent but not mutually exclusive. They will work together, and often are used in that way. However, sparing is particularly important if the system does not have hot swap (or warm swap) capability. The reason is that sparing will allow the array to get back into normal operating mode quickly, reducing the time that the array must operate while it is vulnerable to a disk failure. At any time either during rebuild to the hot spare or after rebuild, the failed drive can be swapped with a new drive. Following the replacement, the new drive is usually assigned to the original array as a new hot spare.
When a RAID array disk drive goes bad, the system must make changes to the configuration settings to prevent further writes and reads to and from the bad drive. Whenever a configuration change happens, the configuration changes have to be written out to all of the disks in the RAID array.
When the operating system or an application wants to access data on a hard disk before it has loaded native drivers for disk access, it traditionally employs BIOS services to do this. BIOS is the abbreviation for Basic Input/Output System. The BIOS provides basic input and output routines for communicating between the software and the peripherals such as the keyboard, screen, and the disk drive. The BIOS is built-in software that determines what a computer can do without accessing programs from a disk. The BIOS generally contains all the code required to control the keyboard, display screen, disk drives, serial communications, and a number of miscellaneous functions. While the access is not necessarily optimal, it is done through an easy to use interface for minimal code can access these devices until the more optimal drivers take over.
The BIOS is typically placed on a ROM (Read Only Memory) chip that comes with the computer (it is often called a ROM BIOS). This ensures that the BIOS will always be available and will not be damaged by disk failures. It also makes it possible for a computer to boot itself.
When users perform complex tasks, they sometimes make mistakes that result in missing RAID arrays or lost data. It is very difficult to find out what happened and recover the missing arrays and data. This can be devastating to a business that has large numbers of records stored in the arrays. It is imperative that there be some way to recover the missing or lost data. Therefore, what is needed is a method and system to easily recover missing arrays and data.