1. Field of Technology
This disclosure relates to computer storage systems, and more particularly to methods and systems unifying primary storage, data protection, and data analytics.
2. Background
Data storage solutions are large business and in large demand for many enterprises. Storage solutions are often designed for specific purposes, and companies often utilize separate systems as data silos dedicated to such purposes, such as primary storage (block and file), backup storage, and storage for analytics. These three copies of storage are generally kept on different devices and managed separately. The movement of data between these three silos can be difficult because there is time involved in determining what changed between the primary silo and backup or analytics silo. This leads to complex backup strategies that attempt to compensate for the length of time required to move the data to the backup and analytics silos. The involved timing covers both determination of what has changed since the last time the data was captured, and moving the data to the new silos typically over a network of some type. This process is usually resource intensive on the primary storage system, consuming critical primary storage resources such as processor cycles, memory, disk operations, and network bandwidth. For this reason, the data move to backup and analytics is often scheduled for off hours and carefully managed to not interfere with daily operations. In addition to processing and timing complications in moving data to backup and analytics systems, restore operations required in the case of failure or loss of primary data can also be time consuming. Further, while the restore operation is occurring primary data is generally not accessible.
In addition to the above timing and computation issues, analytics systems today, such as those using Hadoop, are independent of the primary storage system in terms of security and user account context. This complicates protection to data access, and generally loses context of when the changes occurred and who made the changes. Many systems also require multiple layers of additional third-party software to extract any information from the data.
Backup systems traditionally focus on recovery point objective (RPO) and recovery time objective (RTO). RPO represents the maximum time period of acceptable risk of data loss—for example, an RPO of 24 hours means on failure of primary storage, up to 24 hours of data might be lost and unrecoverable. RTO represents the maximum acceptable time for recovery after a failure before operation can resume—for example, an RTO of 24 hours mean on failure of primary storage, restoration from backup will take up to 24 hours before the primary system is restored and can resume normal operation.
Recovering or restoring from a backup system is generally a difficult and time-consuming process. Recovering from backup generally requires identifying a file (or set of files) and a timestamp (date). If the date or file is unknown, the already time-intensive restoration process becomes greatly more complex. Searching data within a backup system to identify a desired restoration without knowledge of the file and date is generally a trial-and-error process, such as picking a date, restoring the backup from that date, searching the restored data to identify if the data includes the desired item, and repeating the process until the desired item is found.
Once a desired file is identified, a restore process starts. Access to the file is typically not granted until the entire restore process is completed. This might result in many minutes or even hours of wait time before users can start using the restored data. This time can be significantly extended due to storage optimization techniques used when storing backup data. For example, to maximize backup capacity backups may be compressed, requiring intensive (and often complete-site) restoration to recover a single file.
There is some movement to merge backup and analytics systems into a single system which uses the backup data for analytics. This has encountered additional problems, as typically backup systems do not keep data in the same format as primary storage. Even if the format is not a problem, issues remain with moving the data and breaking the connection between the primary storage and change insights. Additionally, applying analytics to backup data has not overcome the problems around determining time and authorship of changes.
3. Description of Prior Art
U.S. Pat. No. 7,412,577 “SHARED DATA MIRRORING APPARATUS, METHOD, AND SYSTEM” (Boyd et al., Aug. 12, 2008) discloses, in the Abstract, “A network component is useful in tracking write activity by writing logs containing write address information is described. The tracking component may be used in networked systems employing data mirroring to record data block addresses written to a primary storage volume during the time a data mirror is unavailable . . . . At the time a data mirror is reconstructed, the log written may be used to construct a list of block addresses pointing to locations on a primary storage volume wherein data differs from a secondary storage volume member of the mirror.” This solution improves data mirroring within a storage network.
U.S. Pat. No. 7,756,837 “METHODS AND APPARATUS FOR SEARCHING BACKUP DATA BASED ON CONTENT AND ATTRIBUTES” (Williams et al., Jul. 13, 2010) discloses, in the Abstract, “Methods and apparatus are disclosed that permit the transparent bridging of a broad range of backup storage devices, such that backup software will identify an intermediate device as a one of those storage devices and will transparently send their backup data-stream thereto as part of the existing standard backup process. Upon receipt of a backup data-stream from the backup software, the methods and apparatus provide for analysis of the data elements in the data-stream, collection of management information about those data elements, and storage of the management information in an easily accessible format for subsequent review and query by users and administrators of the original data.” This solution provides indexing and search capabilities to backup data.
U.S. Pat. No. 7,937,365 “METHOD AND SYSTEM FOR SEARCHING STORED DATA” (Prahlad et al., May 3, 2011) discloses, in the Abstract, “Systems and methods for managing data associated with a data storage component coupled to multiple computers over a network are further disclosed. Additionally, systems and methods for accessing documents available through a network, wherein the documents are stored on one or more data storage devices coupled to the network, are disclosed.” This solution provides indexing, search, and access to data across multiple repositories including secondary storage.
United States Patent Application Publication 2009/0083336 “SEARCH BASED DATA MANAGEMENT” (Srinivasan, Mar. 26, 2009) discloses, in the Abstract, “The invention includes a system including one or more storage devices including the data items a metadata tagging component for associating metadata to each data item, a policy component defining one or more data management policies as a function of the metadata, a search engine for generating a list of data items satisfying the data management policy, and a data management application for applying the data management policy to each data item in the list of data items generated by the search engine.” This solution creates metadata for “a priority . . . , a owner . . . , a group . . . , a last accessed time . . . , a last modified time . . . , a created time . . . , an archival time . . . , a logical location . . . , and a physical location of the data item.” A search is performed of the metadata, and backup, retention, and archiving rules are applied to the search results.
U.S. Pat. No. 8,055,745 “METHODS AND APPARATUS FOR ACCESSING DATA FROM A PRIMARY DATA STORAGE SYSTEM FOR SECONDARY STORAGE” (Atluri, Nov. 8, 2011) discloses, in the Abstract, “A system for providing secondary data storage and recovery services for one or more networked host nodes includes a server application for facilitating data backup and recovery services; a first data storage medium accessible to the server application, a second data storage medium accessible to the server application; at least one client application for mapping write locations allocated by the first data storage medium to write locations representing a logical view of the first data storage medium; and at least one machine instruction enabling direct read capability of the first data storage medium by the server application for purposes of subsequent time-based storage of the read data into the secondary data medium.” This solution splits (mirrors) data between primary and backup storage, providing continuous backup rather than discrete (backup-window) backups. Metadata including “source address, destination address, LUN, frame sequence number, offset location, length of payload, and time received” specific to every data frame is tracked, details of which are used in verification and compression.
None of the above provides a storage solution with 1) integrated primary storage, data protection, and data analytics; 2) in-line data analytics tracking data access and data modifications; 3) RPO based on data analytics rather than fixed time; 4) extendible metadata generation including content analytics; and 5) RTO minimized to restoration of metadata rather than complete site restoration, all without requiring separate backup data streams, or additional servers and software to coordinate operations between multiple systems. What is needed, therefore, is a solution that overcomes the above-mentioned limitations and that includes the features enumerated above.