The invention relates generally to data recovery in enterprise computer systems and more particularly to determining when a database object has changed, determining if the change was authorized or inadvertent, and identifying a non-corrupt point-in-time copy of data to recover.
Data corruption is an important problem for customers of enterprise computer systems. Causes for data corruption may include one or more of application software bugs, human configuration errors, and physical media errors caused by disks, caching, controller firmware bugs and/or multi-path drivers. Another category of errors are inadvertent writes to a device which may be the result of Storage Area Network (SAN) configuration errors, unintentional writes by a privileged user, silent disk errors, or damage caused by virus attacks and malicious worms. This category of errors will be referred to as “inadvertent writes” hereinafter.
Errors caused by software bugs, human configuration errors, and physical media errors (collectively referred to as “logical errors”) can be detected and corrected by referring to the logs available at the application tier, the middleware tier, and the storage tier. However, the inadvertent writes cannot be detected and corrected in the same way using available logs. Storage subsystem hardware and storage management software have long provided the ability to make “Point-In-Time” copies of data managed by the systems. In order to recover from the logical errors at a production server using Point-in-Time copy techniques, stored volume copies must be accessed and the data as it existed before the error occurred must be used to restore the production system to a state as before the error.
The primary methods of creating a Point-in-Time copy are the Split Mirror method and the Copy-on-Write (COW) method. In the Split Mirror method, a traditional RAID-1 mirror is removed from the configuration of the production system and used for some other purpose. In the Copy-on-Write (COW) method, at the time the Point-in-Time copy is initiated, a bitmap is created to manage regions of the storage (sectors, tracks, or blocks, for example). If a region of the original volume is to be modified, the original contents of the region are first copied to the storage region allocated to the Point-in-Time copy. There are also variations of the Copy-on-Write technique that modify the original volume in order to reduce the number of I/O operations required to preserve the integrity of the Point-in-Time copy, such that only changed blocks are copied.
A primary purpose of the various Point-In-Time copy technologies is to provide a means to quickly recover from logical errors caused by software bugs or human error. One of the most common uses of Point-in-Time copies is to make backups of enterprise databases. The controller periodically takes a snapshot of the available volumes. At the time of data corruption, the database administrator needs to decide on the “optimal” stored snapshot to use for data recovery. Determining which snapshot is optimal is typically a trade-off between data loss (a Recovery Point Objective or “RPO”) and the amount of time it takes to get the production system back on-line after a failure (a Recovery Time Objective or “RTO”).
As noted above, errors caused by software bugs, human configuration errors, and physical media errors can be detected and corrected by referring to the logs available at the application, middleware, and storage tiers using any of the existing Point-in-Time copy techniques.