1. Field of Technology
This disclosure relates to computer storage systems, and more particularly to live data restoration in methods and systems that unify primary storage, data protection, and data analytics functions.
2. Background
Data storage solutions are large business and in large demand for many enterprises. Storage solutions are often designed for specific purposes, and companies often utilize separate systems as data silos dedicated to such purposes, such as primary storage (block and file), backup storage, and storage for analytics. These three copies of storage are generally kept on different devices and managed separately. The movement of data between these three silos can be difficult because there is time involved in determining what changed between the primary silo and backup or analytics silo. This leads to complex backup strategies that attempt to compensate for the length of time required to move the data to the backup and analytics silos. The involved timing covers both determination of what has changed since the last time the data was captured, and moving the data to the new silos typically over a network of some type. This process is usually resource intensive on the primary storage system, consuming critical primary storage resources such as processor cycles, memory, disk operations, and network bandwidth. For this reason, the data move to backup and analytics is often scheduled for off hours and carefully managed to not interfere with daily operations. In addition to processing and timing complications in moving data to backup and analytics systems, restore operations required in the case of failure or loss of primary data can also be time consuming. Further, while the restore operation is occurring primary data is generally not accessible.
In addition to the above timing and computation issues, analytics systems today, such as those using Hadoop, are independent of the primary storage system in terms of security and user account context. This complicates protection to data access, and generally loses context of when the changes occurred and who made the changes. Many systems also require multiple layers of additional third-party software to extract any information from the data.
Backup systems traditionally focus on recovery point objective (RPO) and recovery time objective (RTO). RPO represents the maximum time period of acceptable risk of data loss—for example, an RPO of 24 hours means on failure of primary storage, up to 24 hours of data might be lost and unrecoverable. RTO represents the maximum acceptable time for recovery after a failure before operation can resume—for example, an RTO of 24 hours mean on failure of primary storage, restoration from backup will take up to 24 hours before the primary system is restored and can resume normal operation.
Recovering or restoring from a backup system is generally a difficult and time-consuming process. Recovering from backup generally requires identifying a file (or set of files) and a timestamp (date). If the date or file is unknown, the already time-intensive restoration process becomes greatly more complex. Searching data within a backup system to identify a desired restoration without knowledge of the file and date is generally a trial-and-error process, such as picking a date, restoring the backup from that date, searching the restored data to identify if the data includes the desired item, and repeating the process until the desired item is found.
Once a desired file is identified, a restore process starts. Access to the file is typically not granted until the entire restore process is completed. This might result in many minutes or even hours of wait time before users can start using the restored data. This time can be significantly extended due to storage optimization techniques used when storing backup data. For example, to maximize backup capacity backups may be compressed, requiring intensive (and often complete-site) restoration to recover a single file.
There is some movement to merge backup and analytics systems into a single system which uses the backup data for analytics. This has encountered additional problems, as typically backup systems do not keep data in the same format as primary storage. Even if the format is not a problem, issues remain with moving the data and breaking the connection between the primary storage and change insights. Additionally, applying analytics to backup data has not overcome the problems around determining time and authorship of changes.
3. Description of Prior Art
U.S. Pat. No. 7,412,577 “SHARED DATA MIRRORING APPARATUS, METHOD, AND SYSTEM” (Boyd et al., Aug. 12, 2008) discloses, in the Abstract, “A network component is useful in tracking write activity by writing logs containing write address information is described. The tracking component may be used in networked systems employing data mirroring to record data block addresses written to a primary storage volume during the time a data mirror is unavailable . . . . At the time a data mirror is reconstructed, the log written may be used to construct a list of block addresses pointing to locations on a primary storage volume wherein data differs from a secondary storage volume member of the mirror.” This solution improves data mirroring within a storage network.
U.S. Pat. No. 7,756,837 “METHODS AND APPARATUS FOR SEARCHING BACKUP DATA BASED ON CONTENT AND ATTRIBUTES” (Williams et al., Jul. 13, 2010) discloses, in the Abstract, “Methods and apparatus are disclosed that permit the transparent bridging of a broad range of backup storage devices, such that backup software will identify an intermediate device as a one of those storage devices and will transparently send their backup data-stream thereto as part of the existing standard backup process. Upon receipt of a backup data-stream from the backup software, the methods and apparatus provide for analysis of the data elements in the data-stream, collection of management information about those data elements, and storage of the management information in an easily accessible format for subsequent review and query by users and administrators of the original data.” This solution provides indexing and search capabilities to backup data.
U.S. Pat. No. 7,937,365 “METHOD AND SYSTEM FOR SEARCHING STORED DATA” (Prahlad et al., May 3, 2011) discloses, in the Abstract, “Systems and methods for managing data associated with a data storage component coupled to multiple computers over a network are further disclosed. Additionally, systems and methods for accessing documents available through a network, wherein the documents are stored on one or more data storage devices coupled to the network, are disclosed.” This solution provides indexing, search, and access to data across multiple repositories including secondary storage.
United States Patent Application Publication 2009/0083336 “SEARCH BASED DATA MANAGEMENT” (Srinivasan, Mar. 26, 2009) discloses, in the Abstract, “The invention includes a system including one or more storage devices including the data items a metadata tagging component for associating metadata to each data item, a policy component defining one or more data management policies as a function of the metadata, a search engine for generating a list of data items satisfying the data management policy, and a data management application for applying the data management policy to each data item in the list of data items generated by the search engine.” This solution creates metadata for “a priority . . . , a owner . . . , a group . . . , a last accessed time . . . , a last modified time . . . , a created time . . . , an archival time . . . , a logical location . . . , and a physical location of the data item.” A search is performed of the metadata, and backup, retention, and archiving rules are applied to the search results.
U.S. Pat. No. 8,055,745 “METHODS AND APPARATUS FOR ACCESSING DATA FROM A PRIMARY DATA STORAGE SYSTEM FOR SECONDARY STORAGE” (Atluri, Nov. 8, 2011) discloses, in the Abstract, “A system for providing secondary data storage and recovery services for one or more networked host nodes includes a server application for facilitating data backup and recovery services; a first data storage medium accessible to the server application, a second data storage medium accessible to the server application; at least one client application for mapping write locations allocated by the first data storage medium to write locations representing a logical view of the first data storage medium; and at least one machine instruction enabling direct read capability of the first data storage medium by the server application for purposes of subsequent time-based storage of the read data into the secondary data medium.” This solution splits (mirrors) data between primary and backup storage, providing continuous backup rather than discrete (backup-window) backups. Metadata including “source address, destination address, LUN, frame sequence number, offset location, length of payload, and time received” specific to every data frame is tracked, details of which are used in verification and compression.
European Patent Publication EP0410630B1 according to the Abstract discloses an apparatus and method for scheduling the storage backup of data sets in either an application or system-managed storage context using an algorithm in which less data and a smaller backup interval (window) are involved other than that used with prior art full, incremental or combination backup policies. An incremental backup policy is sensitive to a pair of adjustable parameters relating to the last backup, last update, and current date affecting each data set and its storage group.
United States Patent Publication 2006/0117048 according to the Abstract discloses a method and system for updating a filter's data after the filter's metadata file is restored. The filter maintains an open handle to the metadata until the filter receives a request to have the metadata restored. The filter then closes the open handle and allows the metadata to be restored. After the metadata is restored, data associated with the filter is rebuilt based on the restored metadata.
United States Patent Publication 2013/0054523 according to the Abstract discloses data objects replicated from a source storage managed by a source server to a target storage managed by a target server. A source list is built of objects at the source server to replicate to the target server. The target server is queried to obtain a target list of objects at the target server. A replication list is built indicating objects on the source list not included on the target list to transfer to the target server. For each object in the replication list, data for the object not already at the target storage is sent to the target server and metadata on the object is sent to the target server to cause the target server to include the metadata in an entry for the object in a target server replication database. An entry for the object is added to a source server replication database.
U.S. Pat. No. 7,376,895 according to the Abstract discloses an integrated multi-application data processing system for generating, storing, and retrieving data files, each data file having a multi-dimensional array of data cells, and a program framework providing a common user interface for at least one application program for user interaction with one or more of the data files. Each of the data cells, which can contain a single data object that includes an object type code and object content, has a unique multi-dimensional cell address with respect to all cells in data files generated by the system. The object content can be self-contained and/or defined in terms of object content of other data objects.
U.S. Pat. No. 7,552,358 according to the Abstract discloses a method for efficient backup and restore using metadata mapping which comprises maintaining a first backup aggregation associated with a primary data object of a primary host at a secondary host, wherein the first backup aggregation includes a first backup version of the primary data object stored within a secondary data object at the secondary host. The method further comprises generating a second backup aggregation, wherein the second backup aggregation includes a second backup version of the primary data object and a backup metadata object corresponding to the secondary data object. The backup metadata object includes a pointer to the second backup version. The method may further comprise restoring the secondary data object, wherein said restoring comprises using the pointer to access the second backup version of the primary data object to restore at least a portion of the secondary data object.
U.S. Pat. No. 8,032,707 according to the Abstract discloses techniques for managing cache metadata providing a mapping between addresses on a storage medium (e.g., disk storage) and corresponding addresses on a cache device at data items are stored. In some embodiments, cache metadata may be stored in a hierarchical data structure comprising a plurality of hierarchy levels. Only a subset of the plurality of hierarchy levels may be loaded to memory, thereby reducing the memory “footprint” of cache metadata and expediting the process of restoring the cache metadata during startup operations. Startup may be further expedited by using cache metadata to perform operations associated with reboot. Thereafter, as requests to read data items on the storage medium are processed using cache metadata to identify addresses at which the data items are stored in cache, the identified addresses may be stored in memory.
U.S. Pat. No. 8,140,573 according to the Abstract discloses that a metadata file can be automatically generated based on a database instance and a user defined maximum depth. The relationships between data objects that constitute a business object may be visualized in a tree. The maximum depth limits the number of levels in the tree to traverse. A metadata file describes the structure of a business object and relationships between sets of data objects that constitute the business object. The structure defined in the metadata file can be used to export instances of the business object from the database. The exported business object instances can be imported to another database.