Data is one of the most important assets of an organization especially as organizations rely more and more on data processing systems for their daily operations. Any loss of data or even loss of access to the data is therefore potentially very costly. For example, an hour of down time for a system handling brokerage operations has been estimated to cost eight million dollars.
Current methods for preventing data loss include using RAID (redundant arrays of inexpensive disks) Using RAID protection alone is, however, not sufficient or cost effective. Furthermore, the industry and technology trends (e.g., building cost-effective storage systems with low-end SATA disks) are such that increasingly higher degrees of redundancies are needed, which is costly both in terms of dollars and performance. RAID protection is therefore typically augmented by periodically copying the data onto a secondary system such as a tape library in a process referred to as backup. When the primary system fails, the data can be retrieved from the secondary system through a process called restore. If the data is copied to a system that is geographically separated from the primary system, the data will be available to allow the organization to continue its business even after a disaster at the primary site. This is usually referred to as remote copying or mirroring. A straightforward approach for backup and restore is to blindly (i.e. without considering the block contents) perform a block-by-block copy from the primary system to the secondary system and vice versa. This, however, results in a lot of unnecessary data copying which wastes processing and network bandwidth. In addition, backup and restore have to be performed on entire volumes of data. As both the retention period and the amount of data stored grow, such an approach is increasingly impractical.
An alternative is to perform the backup at the file level in which case the system knows when a file was last updated so that it can choose to backup only those files that have been updated since the last backup. Backing up only the updated files is called incremental backup. File-level backup also makes it possible to selectively backup and restore files. Backing up only the updated files does not, however, work well for important applications such as databases that store data in very large files. This is because an entire file is transferred to the secondary system even when only a single byte of that file is changed. Ideally, we want the system to perform “true” incremental backup, copying to the secondary system only the portions of the data that have actually changed. Detecting the changed portions is however difficult and requires substantial processing and I/O. For example, the system would have to keep previous versions of the data or summaries (e.g., hash values) of previous versions, and perform comparisons. Besides the problems outlined above, current approaches for data protection work independently, often performing a lot of redundant processing (e.g., backup and remote copy). More importantly, they do not offer a holistic or integrated way to manage and reduce data loss. For instance, RAID protects all the data in an array to the same extent even though some are more important (e.g., more recently written) and some have already been backed up. This is clearly not optimal from the overall perspective of reducing data loss. There remains therefore a great need for a holistic approach to reliably store data and to efficiently perform true incremental backup and remote copy. The present invention satisfies this need by preferentially handling data that has yet to be copied to a secondary system.