Generally, in the field of computing, the term “The Cloud” refers to the Internet. Similarly, the term “Cloud Computing” generally refers to distributed computing over a communication network such as the Internet, an intranet, a local area network (LAN) or wide area network (WAN).
The term “Cloud Storage” generally refers to the storing of data in the form of electronic documents on multiple servers or virtual servers, which may be hosted by third parties, in order that the data may be accessed by one or more users via the network in question.
Increasing amounts of data are stored (or backed-up) using Cloud Storage either as well as or instead of being stored (or backed-up) on personal hard-drives or other media. There are various “storage-as-a-service” products available, including Google's “Google Drive”, Microsoft's “OneDrive” (previously referred to as “SkyDrive”), Amazon's “Cloud Drive” and Apple's “iCloud”.
In this context, protecting data against failure of storage devices (a statistically rare event that becomes a fact of life when dealing with the thousands of hard drives that may make up a cloud storage facility) is a major concern. In order to avoid data loss, service-providers generally need to maintain multiple copies of each file so as to minimise the risk that information is irrecoverably lost as a result of any isolated drive failure or failures.
An issue of great importance to cloud-storage service-providers is “How many copies is enough?”, as the risk of total loss drops asymptotically to zero with an increase in the number of replicas stored (i.e. for any finite number of copies, there is a chance of irrecoverable loss, but as long as any failed storage device is replaced with another that is then populated with the same content within an amount of time during which the failure of any individual storage device is a statistically rare event, the chance of total loss of any document quickly becomes vanishingly small as the number of copies increases).
For example, the Google File System (GFS) reportedly maintains three copies of every file, which of course means that the total storage capacity needs to be at least three times the size of the actual amount of unique data. This, combined with the “80/20 rule” that characterises file popularity (i.e. approximately 80% of all input/output (I/O) events is accounted for by the most popular 20% of files), means that the vast majority of content is rarely (if ever) accessed, and that a very large fraction of the total storage capacity is wasted on back-up copies of files that are never (or almost never) used.
This very safe but costly approach to “storage-as-a-service” becomes increasingly unsustainable as volume grows exponentially and users start relying more and more on Cloud Storage to store, access and back up content. As a result, finding more efficient ways of maintaining the substantial infrastructure is becoming increasingly important.
Referring to prior art disclosures and techniques, a paper by Lingwei Zhang and Yuhui Deng entitled “Designing a Power-aware Replication Strategy for Storage Clusters” (Proceedings of the 2013 IEEE International Conference on Green Computing, pages 224-231) provides an introduction to the field of Cloud Storage and the associated reliability and availability issues. It focuses on the matter of how to segregate popular and unpopular files between drives so as to be able to power down part of the storage facility without compromising response time.
In a paper by Ouri Wolfson, Sushi Jajodia and Yixiu Huang entitled “An Adaptive Data Replication Algorithm” (ACM Transactions on Database Systems (TODS), Volume 22, Issue 2, June 1997, pages. 255-314), the authors propose the use of dynamic read-write patterns as input to a decision function that determines when and where to replicate an object in a distributed database (which can be regarded as a fore-runner to Cloud Storage) so as to maximise overall performance. In order to achieve faster performance, a dynamic (rather than static) replication strategy is proposed.
A paper by Edith Cohen and Scott Shenker entitled “Replication Strategies in Unstructured Peer-to-Peer Networks” (Proceedings of SIGCOMM '02, Aug. 19-23, 2002, Pittsburgh, Pa., US) presents purported advantages of a replication strategy that sits between making the number of copies of a file uniform and making it proportional to file popularity.
A paper by David Bindel et al entitled “OceanStore: An Extremely Wide-Area Storage System” (Report No. UCB/CSD-00-1102 of the Computer Science Division (EECS), University of California, Berkeley, Calif. 94720, March 1999) describes “OceanStore” as “a utility infrastructure designed to span the globe and provide continuous access to persistent information”. It proposes an infrastructure comprised of untrusted servers in which data is protected through redundancy and cryptographic techniques. Data may be cached anywhere, at any time, and monitoring of usage patterns is said to allow for adaptation to regional outages and denial of service attacks, and to enhance performance through pro-active movement of data.
A paper by Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung entitled “The Google File System” (Proceedings of the 19th ACM Symposium on Operating Systems Principles (2003), pages 29-43) discusses the Google File System (GFS) and the “three copies” concept referred to above.