With the advent of emerging data platforms (EDP), information technology (IT) backup and recovery functions are changing in ways that are not currently well understood. EDP technologies achieve read and write scale by leveraging a hash function that uniformly distributes a set of data across a number of server nodes. Replication which is currently built into most EDP technologies partially addresses backup requirements by protecting against node failures. Replication alone fails to render the idea of backup moot. Specifically the built-in replication within the EDP fails to address the persistence use-case for backup. Persistence and persisting refer to the availability of historic data rather than the fundamental reliability of data storage required by data storage systems. Datasets in the cloud environment are often extremely large. so regular and incremental backups require more storage and compute resources and more process coordination. Organizations' use of information technology (IT) and infrastructure computing resources are moving away from a static environment to a more dynamic and fluid computing environment. Traditionally, organizations' computing resources existed on fixed infrastructure owned by the organization and controlled directly by the organization. However, with the virtualization of computing resources, and shared computing environments (e.g., cloud computing), a computing resource consumer's application and computing service requests may reside on and use a variety of dynamic virtual systems and resources, and use any number of service providers to meet the users service-level agreements (SLAs).
Backups are performed for the purposes to provide availability to users or systems to access current “live” data, and the persistence to access data at a past point in time. Distributed architectures perform well regarding availability, so that at any time the data set with a replication factor is hosted across a number of servers. The user or another system may perform a read or write at any given time. Distributed architectures also do well regarding nodular failures such that when a server goes down the distributed architecture and replication factor recovers the data for that one server. Similarly, for zone failures (e.g., a data center goes down), a cluster may be arranged in a configuration distributed across multiple geographic zones to limit risk to server outages. Even so, backup for persisting data is not addressed by the replication built-in to distributed architectures. Instead, backup for persistence may be achieved by copying a snapshot of live data to on/off-site disk at regular intervals (e.g., 24 hr or weekly). Replication to persisted snapshots (e.g., via SAN) may reduce the needed storage in the EDP, but requires the need to restore or synchronize systems. Current snapshot mechanisms protect the dataset in its entirety, but fail to protect data subsets that may include user, project and/or specific file/objects. Backups are performed for at least two purposes: 1) availability, and 2) persistence.
Distributed architectures do not account for persistence such that the user may roll back to a particular point in time and selectively recover data without also recovering the entire data set. For example, rolling back the environment to see what the user's data looked like a week ago (e.g., a user profile may have been updated and the user desires to return to a previous known profile configuration). The known distributed architectures perform data restores of an entire EDP system in order to recover particular data (e.g., an individual user's profile data such as a favorites list) from a full data set.
Availability guarantees data access in event of equipment failure. Indeed, NoSQL replicates data across nodes which protects against server failure. However, out-of-the-box NoSQL does not account for site failures (e.g., NoSQL cluster was hosted in Amazon Web Services' US East region). NoSQL maintains the current state of data. Also although the NoSQL platform otherwise uses a traditional backup snapshot for backups and recovery, NoSQL does not take into account persistence of user data. The NoSQL platform does not take into account persistence in terms of when data is updated (e.g., overwritten) by a user, and the historical information does not persist.