As businesses increasingly depend on data and data size continues to increase the importance of data backup and recovery likewise increases.
Further, data processing has moved beyond the world of monolithic data centers housing large mainframe computers with locally stored data repositories, which is easily managed and protected. Instead, today's data processing is typically spread across numerous, geographically disparate computing systems communicating across multiple networks.
One well-known distributed database example is a No-SQL (Not Only Structured Query Language) database called Cassandra, which is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based on the understanding that system and hardware failures can and do occur. In one sense, Cassandra addresses the problem of failures by employing a peer-to-peer distributed system across homogenous nodes where data is regularly and periodically distributed amongst all the nodes in a cluster. Referring now to FIG. 1, a simplified example of the Cassandra architecture can be seen. While oftentimes referred to as a ring architecture, fundamentally it comprises a cluster of nodes (e.g., Node 1, Node 2 and Node 3, each of which is typically running on a physically separate server computing system) communicating with each other across a network (e.g., Network 110) such as a local area network, a wide area network or the internet.
Within each node, referring now to FIG. 2, a sequentially written disk-based commit log 209 captures write activity by that node to ensure data durability. Data is then indexed and written to an in-memory (i.e., working memory 205) structure, called a memory table or a memtable 203, which resembles a write-back cache. Once the memory structure is full, in what is called a flush operation, the data is written from the memtable 203 in working memory 205 to long term storage (denoted “disk 207” although it may be a solid state device such as flash memory) in what is known as a Sorted String Table (SSTable) type data file 211. Once the data has been written to a data file 211 on disk 207 then the commit log 209 is deleted from the disk 207. As is known in the art, these SSTable data files 211 are immutable in that updates and changes are made via new memtable entries which create new SSTable data files rather than overwriting already existing SSTable data files. A process called compaction periodically consolidates SSTables, to discard old and obsolete data.
Of course, having data be created and stored locally on various nodes geographically spread across numerous locations compounds existing data backup challenges. It is therefore desirable to find a solution that addresses these various challenges.