Embodiments of the invention relate to the field of data storage, and in particular, to selecting a data restore point with an optimal recovery time and recovery point.
Business critical enterprise applications suffer data loss and downtime from event failures encountered by a system associated with such applications. Data corruption is a common cause of application data loss and downtime. Data corruption may result from a data variable's value(s) becoming incorrect, deleted, or unreadable. Inconsistent value(s) being may be caused by human configuration errors, physical media errors, storage controller failures, firmware errors, logical software bugs, virus attacks, or malicious worms.
A point-in-time copy of data is a copy of the state of a storage device at a given point-in-time. For example, storage systems take periodic (e.g., every ½ hour) snapshots or point-in-time copies of data stored on the storage system. Point-in-time copies of data are used to restore data, when a primary copy of data on the storage device is lost or corrupted. A point-in-time copy of a data volume may be a logical copy of the data volume, also referred to as a snapshot, when only the changed data blocks are maintained. A point-in-time copy of a data volume can also be a physical copy of the data volume, also referred to as a clone, when a complete copy of the data volume is created on the same or a different set of physical disks.
Point-in-time copies of data are used for backing up high-availability systems that enable efficient system and data recovery. A point-in-time copy of data may be used to revert back to data at a previous satisfactory state to resolve a data error in the primary copy of data. System administrators currently try the most recent point-in-time copies of data for a data restore, manually one by one, until a consistent point-in-time copy of data is found. System administrators start with the latest point-in-time copy and continue to earlier point-in-time copies of data, until a non-corrupt version of the data is found. Each point-in-time copy of data is tested for consistency to determine whether the point-in-time copy of data is corrupt. As a result, data restore requires repeating manual mounting and testing of each point-in-time copy until a valid point-in-time copy of data is found.
System administrators may also manually review event logs to determine a root-cause of data corruption and manually select a point-in-time copy for recovery based on the root-cause. For example, various components (e.g., storage controller, a server's operating system) in an end-to-end system associated with a point-in-time copy of data log events in event logs. Manual examination of event logs typically requires reviewing of a large number of event logs because of the amount of components in an end-to-end system and an amount of time that could have lapsed from an event causing the corruption. In addition, manual examination of event logs requires domain knowledge of complex enterprise systems.