This disclosure relates generally to data storage systems and, in particular, to cloud-based scalable storage systems used for data backup by heterogeneous clients in a network.
As computers, smart phones, tablets, laptops, servers, and other electronic devices increase in performance year to year, the data they generate also increases. Individuals and enterprises have in the past managed their own data backup systems but as the volumes of data grow, it has become impractical for many individuals and organizations to manage their own backup systems.
However, commercial providers of data backup services face many challenges related to the management of vast quantities of data from multiple clients. When data volumes grow into the range of hundreds of terabytes or even petabytes, many conventional data management techniques fail to scale economically and efficiently. Being able to service hundreds or even thousands of simultaneous data requests from remote clients may also be a challenge for many off the shelf database systems such as MYSQL or SQL SERVER.
While there are other structured storage systems that offer much better scalability and provide for parallel access by hundreds of clients, these structured storage systems do not usually provide the transactional reliability—i.e. atomicity, consistency, isolation, and durability (ACID compliance)—provided by traditional relational database systems. Without ACID compliance the reliability and internal consistency of customer data is difficult to guarantee, especially when data volumes and client numbers soar. This problem is made more severe when the storage systems attempt to deduplicate client data. Deduplication allows duplicate data (including both files and sub-file structures) to be stored only once, but to be accessed by multiple clients. Deduplication can reduce the storage requirements for an enterprise or individual significantly. However, deduplication results in multiple references to stored data. When multiple clients have references to the same data, and clients are able to access the data concurrently, the lack of atomicity and isolation in database transactions can lead to fatal consistency problems and data loss. Using conventional parallel processing techniques such as access locks on shared data is impractical when client numbers grow into the hundreds because such locks stall concurrent access and degrade client performance to an unacceptable degree.