Field of Technology
This disclosure relates to computer storage systems, and more particularly to methods and systems unifying primary storage, data protection, and data analytics.
Background
Data storage solutions are large business and in large demand for many enterprises. Storage solutions are often designed for specific purposes, and companies often utilize separate systems as data silos dedicated to such purposes, such as primary storage (block and file), backup storage, and storage for analytics. These three copies of storage are generally kept on different devices and managed separately. The movement of data between these three silos can be difficult because there is time involved in determining what changed between the primary silo and backup or analytics silo. This leads to complex backup strategies that attempt to compensate for the length of time required to move the data to the backup and analytics silos. The involved timing covers both determination of what has changed since the last time the data was captured, and moving the data to the new silos typically over a network of some type. This process is usually resource intensive on the primary storage system, consuming critical primary storage resources such as processor cycles, memory, disk operations, and network bandwidth. For this reason, the data move to backup and analytics is often scheduled for off hours and carefully managed to not interfere with daily operations. In addition to processing and timing complications in moving data to backup and analytics systems, restore operations required in the case of failure or loss of primary data can also be time consuming. Further, while the restore operation is occurring primary data is generally not accessible.
In addition to the above timing and computation issues, analytics systems today, such as those using Hadoop, are independent of the primary storage system in terms of security and user account context. This complicates protection to data access, and generally loses context of when the changes occurred and who made the changes. Many systems also require multiple layers of additional third-party software to extract any information from the data.
Backup systems traditionally focus on recovery point objective (RPO) and recovery time objective (RTO). RPO represents the maximum time period of acceptable risk of data loss—for example, an RPO of 24 hours means on failure of primary storage, up to 24 hours of data might be lost and unrecoverable. RTO represents the maximum acceptable time for recovery after a failure before operation can resume—for example, an RTO of 24 hours mean on failure of primary storage, restoration from backup will take up to 24 hours before the primary system is restored and can resume normal operation.
Recovering or restoring from a backup system is generally a difficult and time-consuming process. Recovering from backup generally requires identifying a file (or set of files) and a timestamp (date). If the date or file is unknown, the already time-intensive restoration process becomes greatly more complex. Searching data within a backup system to identify a desired restoration without knowledge of the file and date is generally a trial-and-error process, such as picking a date, restoring the backup from that date, searching the restored data to identify if the data includes the desired item, and repeating the process until the desired item is found.
Once a desired file is identified, a restore process starts. Access to the file is typically not granted until the entire restore process is completed. This might result in many minutes or even hours of wait time before users can start using the restored data. This time can be significantly extended due to storage optimization techniques used when storing backup data. For example, to maximize backup capacity backups may be compressed, requiring intensive (and often complete-site) restoration to recover a single file.
There is some movement to merge backup and analytics systems into a single system which uses the backup data for analytics. This has encountered additional problems, as typically backup systems do not keep data in the same format as primary storage. Even if the format is not a problem, issues remain with moving the data and breaking the connection between the primary storage and change insights. Additionally, applying analytics to backup data has not overcome the problems around determining time and authorship of changes.