This application builds upon the inventions by Applicant disclosed in the following patents and applications:                U.S. Pat. No. 9,344,287, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT SYSTEM FOR MULTICAST REPLICATION”; U.S. patent application Ser. No. 14/095,839, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT METHOD FOR MULTICAST REPLICATION”; and U.S. patent application Ser. No. 14/095,848, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT WITH CLIENT-CONSENSUS RENDEZVOUS” (together, the “Replicast Applications”);        U.S. patent application Ser. No. 14/312,282, which was filed on Jun. 23, 2014 and titled “KEY/VALUE STORAGE DEVICE AND METHOD” (“KVT Application”); and        U.S. Utility patent application Ser. No. 15/137,920, which was filed on Apr. 25, 2016 and titled “PARITY PROTECTION FOR DATA CHUNKS IN AN OBJECT STORAGE SYSTEM” (“Manifest Striping Application”).        The Replicast Applications, KVT Application, and Manifest Striping Application are each incorporated by reference herein and referred to collectively as the “Incorporated References.”        
a. A Replicast Storage System
With reference now to existing relevant art developed by Applicant, FIG. 1 depicts storage system 100 described in the Incorporated References. Storage system 100 comprises clients 110a, 110b, . . . 110i (where i is any integer value), which access initiator/application layer gateway 130 over client access network 120. It will be understood by one of ordinary skill in the art that there can be multiple gateways and client access networks, and that gateway 130 and client access network 120 are merely exemplary. Gateway 130 in turn accesses replicast network 140, which in turn accesses storage servers 150a, 150b, 150c, 150d, . . . 150k (where k is any integer value). Each of the storage servers 150a, 150b, 150c, 150d, . . . 150k is coupled to a plurality of storage devices 160a, 160b, . . . 160k, respectively.
In this patent application the terms “initiator”, “application layer gateway”, or simply “gateway” refer to the same type of devices and are used interchangeably.
FIG. 2 depicts a typical put transaction in storage system 100 to store chunk 220. As discussed in the Incorporated References, groups of storage servers are maintained, which are referred to as “negotiating groups.” Here, exemplary negotiating group 210a is depicted, which comprises ten storage servers, specifically, storage servers 150a-150j. When a put command is received, gateway 130 assigns the put transaction to a negotiating group. In this example, the put chunk 220 transaction is assigned to negotiating group 210a. It will be understood by one of ordinary skill in the art that there can be multiple negotiating groups on storage system 100, and that negotiating group 210a is merely exemplary, and that each negotiating group can consist of any number of storage servers and that the use of ten storage servers is merely exemplary.
Gateway 130 then engages in a protocol with each storage server in negotiating group 210a to determine which three storage servers should handle the put request. The three storage servers that are selected are referred to as a “rendezvous group.” As discussed in the Incorporated References, the rendezvous group comprises three storage servers so that the data stored by each put transaction is replicated and stored in three separate locations, where each instance of data storage is referred to as a replica. Applicant has concluded that three storage servers provide an optimal degree of replication for this purpose, but any other number of servers could be used instead.
In varying embodiments, the rendezvous group may be addressed by different methods. all of which achieve the result of limiting the entities addressed to the subset of the negotiating group identified as belonging to the rendezvous group. These methods include:                Selecting a matching group from a pool of pre-configured multicast groups each holding a different subset combination of members from the negotiating group;        Using a protocol that allows each UDP message to be addressed to an enumerated subset of the total group. An example of such a protocol would be the BIER protocol currently under development by the IETF; and        Using a custom control protocol which allows the sender to explicitly specify the membership of a target multicast group as being a specific subset of an existing multicast group. Such a control protocol was proposed in an Internet Draft submitted to the IETF titled “Creation of Transactional Multicast Groups” and dated Mar. 23, 2015, a copy of which is being submitted with this application and is incorporated herein by reference.        
In FIG. 3, gateway 130 has selected storage servers 150b, 150e, and 150g as rendezvous group 310a to store chunk 220.
In FIG. 4, gateway 130 transmits the put command for chunk 220 to rendezvous group 310a. This is a multicast operation. In this example, three replicas of chunk 220 will be stored (labeled as replicas 401a, 401b, and 401c).
b. Mechanisms to Recover Data when Disk Drives Fail
In a well-known aspect of the prior art, storage servers such as storage servers 150a . . . 150k often utilize physical disk drives. However, disk drives are unreliable. They break. The connections to them break. The servers that access them break. For a storage cluster containing a significant number of disk drives, drive failures are predictable routine events, not exceptional errors. Having a single persistently stored copy of some data does not mean that the data is saved persistently. It is only safe until something loses or blocks access to that replica.
There are several prior art strategies to ensure that data is truly saved persistently. These include creating multiple whole replicas of the data, RAID encoding, and Erasure Coding. Each of these strategies increases the probability of successfully retaining data higher compared to a system that retains only a single replica or slice.
All of these data protection methods can be characterized by the number of slices or chunks being protected (N) and the number of additional slices or chunks that protect the data (M). The total size written is N+M, and the data for any N of the slices can be recovered. The different methods vary in how much overhead is required (the ratio of M to N) and the complexity of creating and using the parity protection data.
c. Replica System
An example of a prior art replica system 500 is shown in FIG. 5. Replica system 500 comprises drive array 510. In this example, drive array 510 comprises three drives (Drive 1, Drive 2, and Drive 3). Each data block that is written as part of a put command is stored once in each drive. Thus, when block A1 is stored, it is stored three times, once in each drive. Creating three whole replicas is a 1:2 scheme. There are three total chunks (1+2), any one of which can recover the original (since each drive stored an exact copy of the original).
d. Parity Protection Systems
Protecting data from the loss of storage devices without fully replicating content has long been a feature of storage systems. Techniques include RAID-5, RAID-6, software RAID and Erasure Coding.
These techniques can be characterized as N:M schemes, where N payload slices are protected by adding M parity slices. Depending on the encoding algorithm used the N payload chunks may be unaltered while the parity protection is encoded in M additional chunks, or the payload and parity protection may be spread over all N+M chunks. An N:M encoding allows recovery of the original data after the loss of up to M slices.
The Manifest Striping Application details a method for efficiently and safely converting an object from whole replica protection to parity protection. One of the motivations for delayed conversion was the assumption that writing the payload chunks and parity protection sets at ingest would consume more network bandwidth than simply multicasting the payload alone.
As explained in the Manifest Striping Application, ingesting new content with whole replica protection is desirable because whole replicas provide the best latency on probable retrievals and because only a single copy of the new content had to be multicast to create enough copies to provide the desired level of data protection (typically against the loss of two drives or servers). It was only later after the probability of read access to the content was low that it was worthwhile to convert to a parity protection scheme.
The whole replica protection strategy is desirable when the extra whole replicas will optimize likely retrieval of the just put object version. It is of less value when the same bandwidth can create a single replica and two parity protectors where the parity protectors can restore the protected chunk. Depending on the precise parity protection scheme the parity protectors may be parity slices protecting payload slices, parity chunks protection payload chunks or for the present invention a “parity protector” which contains both a manifest of the protected chunks and the product payload. The parity protection slices or chunks contain just the product payload and are described elsewhere.
All of these schemes protect against the concurrent loss of two servers or chunks the while using the same storage to protect N payload chunks, greatly reducing the total storage required.
Additional detail regarding the embodiments of the Manifest Striping Application is shown in FIGS. 6A, 6B, and 6C.
FIG. 6A depicts a replica technique for various chunks. Manifest 610 (labeled as Manifest A) refers to payload chunks 601, 603, and 605 (labeled Payload Chunks C, D, and E), and manifest 620 (labeled as Manifest B) refers to payload chunks 601, 603, and 607.
It is common for different manifests to refer to some of the same payload chunks when the underlying objects are related, as might be the case when they are portions of two versions of the same file. In this particular example, perhaps manifest 610 is associated with a first draft of a word processing document, and manifest 620 is associated with a second draft of the same word processing document, and payload chunks 601 and 603 are the portions of the document that have not changed from one version to the next.
In this example, manifest 610 has three replicas (represented by the two additional boxes underneath the box for manifest 610). Payload chunks 601, 603 and 605 also have three replicas each (represented by the boxes underneath each payload chunk). The relationships between manifests and referenced chunks are between the conceptual chunks, not between the specific replicas. The second replica of Manifest 610 has chunk references to payload chunks 601, 603 and 605. These same references are in the first and third replica of Manifest 610. The chunk references specify the chunk IDs of payload chunks 601, 603 and 605. The reference does not specify a specific replica or any specific location.
There are back-reference lists associated with each of the payload chunks. These back-references are to the manifest chunk by its chunk ID. They do not reference a specific replica.
With reference to FIG. 6B, when it is desirable to switch from a replica system to a parity system for this particular data set (such as for the reasons described with respect to FIG. 12, below), the effective replication count for manifests are not altered. Therefore, there will still be three replicas of each of the manifest chunks. There will also be whole replica protection for the parity protection content manifests. A back-reference from each created parity protection chunk references the chunk ID of the parity protection content manifest. This prevents the parity protection chunk from being expunged while it is referenced in a parity protection content manifest.
With reference to FIG. 6C, when it is desirable to switch from a parity system to a replica system for this particular data set (such as for the reasons described with respect to FIG. 12, below), the effective replication count from the manifest to the referenced payload chunks will be restored to the number of whole replicas desired. This will cause the storage servers to begin replicating the whole referenced payload chunks until there are the desired number of whole replicas. Concurrently, the parity protection content manifest may be expunged after the parity protection chunks are no longer required to protect the object version's payload from the designated number of target losses. Alternately, an implementation may elect to retain the parity protection even while carrying full replica protection if return to parity protection is anticipated to occur relatively soon.
Protecting stored data with error correction codes or parity of stored data has been well known art in the data storage since before the 1990s. This has extended from purely hardware solutions and to more sophisticated parity algorithms.
U.S. Pat. No. 5,499,253 A “System and method for calculating RAID 6 check codes” (Lary) discloses a method for calculating multiple checksums from the same set of protected data stripes. RAID-6 enables protection from the loss of two drives, in contrast to RAID-5 which only protected from the loss of a single drive.
Sun Microsystems' RAID-Z, as disclosed in “RAID-Z” in “Jeff Bonwick's Blog” on Nov. 17, 2005, uses an encoding equivalent to RAID-5 under software control where the data is striped over drives that no longer have any mandated fixed physical relationship to each other. RAID-Z was subsequently extended to RAID-Zn to provide for protection against the loss of more than one drive concurrently.
U.S. Pat. No. 8,316,260, “Method and System for Multi-Dimensional RAID” (Bonwick), describes a method for a RAID controller to assign blocks to a data grid where different rows and columns are used to identify multiple non-overlapping ‘parity groups’. The present invention uses a different technique to assign non-overlapping parity protection groups. The present invention has different steps and avoids centralizing assignment of blocks to parity groups or sets.
U.S. Patent Application No. 2004/0160975, “Multicast communications protocols, systems and methods” (Frank), discloses an application of multicast updating of a RAID stripe where multicast communications is used to allow the delta to the parity stripe to be updated without requiring the entire payload to be read. This relates to optimal updating of a volatile RAID encoding where each write updates the existing data.
Multicast communications are also used in various schemes where RAID encoding is used to enable error recovery at the receiving end for long haul video-on-demand systems. RAID encoding is bandwidth inefficient compared to forward-error-correction (FEC) techniques. Use of RAID algorithms is mostly described for older solutions where there were concerns about the CPU requirements for FEC error correction. Erasure coding and/or network coding are now favored as solutions for reliable multicast delivery over drop-prone networks where explicit per receiver acknowledgement is undesirable or infeasible. RFC 3453 (“The Use of Forward Error Correction (FEC) in Reliable Multicast”), dated December 2002, describes both simple FEC and erasure coding as techniques to detect and correct transmission errors for multicast transmission. These approaches are not relevant to multicast delivery within a data center network where transmission errors are exceedingly rare.
What the above-described systems lack is the ability to perform a put operation on a new data chunk with parity protection while using only the data bandwidth required for a single multicast transmission of the new content. The present invention seeks to retain the benefits of multicast chunk distribution while efficiently creating parity protected data. This would be useful, for example, when the system knows that the data to be saved is likely to be “cold” from the outset, as might be the case if the system is storing, as might be the case for email saved in a SPAM folder, an archive created by a backup utility, or a draft document.