Business enterprises rely increasingly on computer systems that allow the sharing of data across a business enterprise. The data storage systems that have evolved to store large amounts of data typically are critically important to an enterprise. As a result, the disruption or failure of the data storage system can cripple operation of the entire enterprise.
Data used by applications running on computer systems are typically stored on primary storage devices (e.g., disks) and secondary storage devices (e.g., tape and cheaper disk drives) for protection. As these applications run, the data changes as a result of business operations. Information technology departments typically deal with a number of problems concerning data storage systems. Generally, however, these fall into two broad categories: hardware failure and data corruption.
The business significance of data storage systems and the importance of the integrity of the data that they store and maintain has generated a correspondingly high interest in systems that provide data protection and data recovery. At present, mirroring and snapshot technology are the two primary approaches available to enterprises interested in data recovery. In the event of a system failure, data recovery allows an enterprise to recover data from a prior point in time and to resume operations with uncorrupted data. Once the timing of the hardware failure or corrupting event, or events, is identified, recovery may be achieved by going back to a point in time when the stored data is known to be uncorrupted.
Typically, data storage devices include individual units of storage, such as cells, blocks, sectors, etc. Read commands generated by a host system (used generally to mean one or more host systems) direct the information system to provide the host with the data specified in the request. Traditionally, the information is specified based on its location within the data storage device, e.g., one or more specific blocks. Write commands are executed in a similar fashion. For example, data is written to a specific unit of storage in response to an I/O request generated by a host system. A location identifier provides direct association between the data and the unit of storage in which it is stored. Thereafter, the location identifier is employed to read and update the data.
On the hardware failure side of the data protection problem, vendors provide a few different mechanisms to help prevent hardware failure from affecting application availability and performance, for example, disk mirroring. This is a mechanism where multiple disks are grouped together to store the same information, allowing a disk to fail without preventing the application from retrieving the data. In a typical setup, the user will allocate 1–4 mirror disks for each application data disk. Each write request that is sent to the application primary disk is also sent to the mirror copies, so that the user actually has N (where N is between 2 and 5 typically) disks with the exact same data on it. As a result, the mirroring approach provides at least one complete backup of the then current data. Thus, if a disk failure occurs, the user still has application data residing on the other mirror disks. A redundant array of independent disks (“RAID”) provides one example of a mirroring system.
However, mirroring is ineffective when data corruption occurs. Data corruption comes in many forms, but it generally is recognized when the user's application stops functioning properly as a result of data being written to the disk. There are many possible sources of data corruption such as a failed attempt to upgrade the application, a user accidentally deleting key information, a rogue user purposely damaging the application data, computer viruses, and the like. Regardless of the cause, mirroring actually works against the user who has experienced data corruption because mirroring replicates the bad data to all the mirrors simultaneously. Thus, all copies of the data are corrupted.
Additionally, because the disks are continuously updated, a backup of historical data, i.e., a snapshot of the data present in the data storage device at a past time T, can only be created if the system is instructed to save the backup at or prior to time T. Thus, at time T+1 the system is unable to provide a backup of the data current at time T. Further, each unit of storage is saved regardless of whether the data stored in it is unchanged since the time that the previous backup was made. Such an approach is inefficient and costly because it increases the storage capacity required to backup the data storage device at multiple points in time. Also, the mirroring approach becomes less efficient and more error prone when employed with larger data storage systems because large systems span hundreds of disks and the systems cannot assure that each disk is backed up at the same point in time. Consequently, complex and error prone processes are employed in an attempt to create a concurrent backup for the entire data storage system.
As described above, snapshots, also referred to as single point in time images, are frequently created in conjunction with a mirroring system. Alternatively, a snapshot approach may be employed as an independent data storage and recovery method. In the snapshot approach, the user selects periodic points in time when the current contents of the disk will be copied and written to either a different storage device or an allocated set of storage units within the same storage device. This approach suffers, however, from the same shortcomings as mirroring, that is, all snapshots are created at the then current point in time either in conjunction with the users request or as a result of a previously scheduled instruction to create a snapshot of the stored data. Whether alone or in combination, neither data mirrors or data snapshots allow the user to employ hindsight to recreate a data set that was current at some past time. Because the data stored in each of the storage units is not associated with an individual time identifier, a user is unable to go back to view data from a particular point in time unless coincidentally a historical backup was previously created for that time. There is no way to restore the data at an intermediate time, for example time (T−1), between the current time (T) and the time that the last backup disk was saved (for example T−2). Also, generation of single point in time images generally is a lengthy process. Image generation time has become even more significant as the storage capacity and data set sizes have increased.
The storage industry, as a result, has focused on providing both faster and more frequent image generation. Suppliers of data recovery systems that employ tapes have attempted to provide larger, more scalable tape libraries by increasing system capacities and the quantity of tape heads in order to allow parallel operation. Suppliers of disk based systems have focused on how to use disk drives to provide more single point in time images with improved response times. In one approach, one of a quantity N mirror disks is brought offline at a specified time in order to create a single point in time image at that time. The approach may allow for an increased number of images provided that the quantity of mirror disks is increased sufficiently. However, this approach significantly increases the required storage capacity with each point in time, for example, for a 5 terabyte application, 30 terabytes of storage are required to support 2 standard mirror disks and 4 point in time images. Because these solutions are only attempts at fixing existing approaches they do not provide a solution that is workable as the capacity of data storage systems continues to increase.