This application builds upon the inventions by Applicant disclosed in the following patents and applications: U.S. patent application Ser. No. 14/095,839, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT SYSTEM FOR MULTICAST REPLICATION”; U.S. patent application Ser. No. 14/095,843, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT SYSTEM FOR MULTICAST REPLICATION”; U.S. patent application Ser. No. 14/095,848, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT WITH CLIENT-CONSENSUS RENDEZVOUS”; U.S. patent application Ser. No. 14/095,855, which was filed on Dec. 3, 2013 and titled “SCALABLE TRANSPORT WITH CLUSTER-CONSENSUS RENDEZVOUS”; U.S. patent application Ser. No. 14/312,282, which was filed on Jun. 23, 2014 and titled “Key/value storage device and method” (the “KVM Encoding Patent Application”); and U.S. patent application Ser. No. 14/820,471, which was filed on Aug. 6, 2015 and titled “Object Storage System with Local Transaction Logs, A Distributed Namespace, and Optimized Support for User Directories” (the “Local Transaction Log Patent Application”). All of the above-listed applications and patents are incorporated by reference herein and referred to collectively as the “Incorporated References.”
a. A Replicast Storage System
With reference now to existing relevant art developed by Applicant, FIG. 1 depicts storage system 100 described in the Incorporated References. Storage system 100 comprises clients 110a, 110b, . . . 110i (where i is any integer value), which access initiator/application layer gateway 130 over client access network 120. It will be understood by one of ordinary skill in the art that there can be multiple gateways and client access networks, and that gateway 130 and client access network 120 are merely exemplary. Gateway 130 in turn accesses replicast network 140, which in turn accesses storage servers 150a, 150b, 150c, 150d, . . . 150k (where k is any integer value). Each of the storage servers 150a, 150b, 150c, 150d, . . . , 150k is coupled to a plurality of storage devices 160a, 160b, . . . 160k, respectively.
In this patent application the terms “initiator”, “application layer gateway”, or simply “gateway” refer to the same type of devices and are used interchangeably.
FIG. 2 depicts a typical put transaction in storage system 100 to store chunk 220. As discussed in the Incorporated References, groups of storage servers are maintained, which are referred to as “negotiating groups.” Here, exemplary negotiating group 210a is depicted, which comprises ten storage servers, specifically, storage servers 150a-150j. When a put command is received, gateway 130 assigns the put transaction to a negotiating group. In this example, the put chunk 220 transaction is assigned to negotiating group 210a. It will be understood by one of ordinary skill in the art that there can be multiple negotiating groups on storage system 100, and that negotiating group 210a is merely exemplary, and that each negotiating group can consist of any number of storage servers and that the use of ten storage servers is merely exemplary.
Gateway 130 then engages in a protocol with each storage server in negotiating group 210a to determine which three storage servers should handle the put request. The three storage servers that are selected are referred to as a “rendezvous group.” As discussed in the Incorporated References, the rendezvous group comprises three storage servers so that the data stored by each put transaction is replicated and stored in three separate locations, where each instance of data storage is referred to as a replica. Applicant has concluded that three storage servers provide an optimal degree of replication for this purpose, but any other number of servers could be used instead.
In varying embodiments, the rendezvous group may be addressed by different methods. All of which achieve the result of limiting the entities addressed to the subset of the negotiating group identified as belonging to the rendezvous group. These methods include:                Selecting a matching group from a pool of pre-configured multicast groups each holding a different subset combination of members from the negotiating group;        Using a protocol that allows each UDP message to be addressed to an enumerated subset of the total group. An example of such a protocol would be the BIER protocol currently under development by the IETF; and        Using a custom control protocol which allows the sender to explicitly specify the membership of a target multicast group as being a specific subset of an existing multicast group. Such a control protocol was proposed in an Internet Draft submitted to the IETF titled “Creation of Transactional Multicast Groups” and dated Mar. 23, 2015, a copy of which is being submitted with this application and is incorporated herein by reference.        
In FIG. 3, gateway 130 has selected storage servers 150b, 150e, and 150g as rendezvous group 310a to store chunk 220.
In FIG. 4, gateway 130 transmits the put command for chunk 220 to rendezvous group 310a. This is a multicast operation. In this example, three replicas of chunk 220 will be stored (labeled as replicas 401a, 401b, and 401c).
b. Mechanisms to Recover Data When Disk Drives Fail
In a well-known aspect of the prior art, storage servers such as storage servers 150a . . . 150k often utilize physical disk drives. However, disk drives are unreliable. They break. The connections to them break. The servers that access them break. For a storage cluster containing a significant number of disk drives, drive failures are predictable routine events, not exceptional errors. Having a single persistently stored copy of some data does not mean that the data is saved persistently. It is only safe until something loses or blocks access to that replica.
There are several prior art strategies to ensure that data is truly saved persistently. These include creating multiple whole replicas of the data, RAID encoding, and Erasure Coding. Each of these strategies increases the probability of successfully retaining data higher compared to a system that retains only a single replica or slice.
All of these data protection methods can be characterized by the number of slices or chunks being protected (N) and the number of additional slices or chunks that protect the data (M). The total size written is N+M, and the data for any N of the slices can be recovered. The different methods vary in how much overhead is required (the ratio of M to N) and the complexity of creating and using the parity protection data.
c. Replica System
An example of a prior art replica system 500 is shown in FIG. 5. Replica system 500 comprises drive array 510. In this example, drive array 510 comprises three drives (Drive 1, Drive 2, and Drive 3). Each data block that is written as part of a put command is stored once in each drive. Thus, when block A1 is stored, it is stored three times, once in each drive. Creating three whole replicas is a 1:2 scheme. There are three total chunks (1+2), any one of which can recover the original (since each drive stored an exact copy of the original).
d. RAID System
Parity-based protection was introduced in late 1980s to early 1990s with the invention of RAID—redundant array of inexpensive disks. An example of one type of prior art RAID system is shown in FIG. 6. Here, RAID-4 system 600 comprises drive array 610. In this example, drive array 610 comprises N drives (Drive 1, Drive 2, . . . Drive N) that store data and one drive (Drive P) that stores parity. Here, data is written in stripes to drive array 610. One example of a stripe is stripe 601. The data is written into blocks A1 on Drive 1, A2 on Drive 2, . . . and AN on Drive N. From these blocks, a parity block, AP, is calculated and stored on Drive P. Numerous methods are known in the prior art for calculating parity. The simplest method is to perform an “XOR” operation on the data to be protected, and to store the result as the parity bit. In the example of FIG. 6, if the XOR method is used, the first bit in each of A1 . . . AN would be XOR'd, and the result would be stored in the first bit location of block AP. The same action would be performed on all remaining bits in the blocks. Additional parity drives (Drive P+1, etc.) can be used if it is desired to make RAID-4 system 600 even more robust against drive failures. Other RAID schemes, such as RAID-5 and RAID-6 are well known.
RAID was introduced as a hardware concept, but has been extended to software RAID solutions such as RAID-Z used in the ZFS storage system developed by Sun Microsystems. Simple RAID-5 or any of its software equivalents like RAID-Z is a N:1 scheme where N data slices are protected by a single parity slice. RAID-6 is an N:2 scheme.
Protection from loss of more than a single drive is provided in RAID-Z2 and RAID-Z3 through the addition of up to two extra parity calculations (Galois transformations dubed “q” and “r” supplement the simple XOR algorithm dubbed “p”). These extra algorithms can recover 2 or 3 lost chunks respectively. Simple XOR parity as in the example described above can only recover from a single loss (i.e., the failure of one drive in the stripe group).
U.S. Pat. No. 8,316,260 (Bonwick) discloses multidimensional RAID which combines additional parity calculations (as from RAID-Zn) with including the same chunk in multiple parity calculations to protect against the loss of multiple drives. Each row or column in an array can provide RAID Zn protection, allowing protection from many lost drives.
RAID techniques that rely on simple XOR calculations for parity can use parallel calculations. This is described in the 1989 TickerTAIP paper, and fully distributed algorithms as described in RADD (Redundant Array of Distributed Drives).
U.S. Pat. No. 6,289,415 (Johnson) discloses asynchronous generation of RAID-style parity protection, but does not combine this with any alternate form of protection before parity generation is completed. The market targeted for this technique was creation of tape archives. The goal of asynchronous parity generation was to avoid the need for synchronized tape drives rather than to complete transactions without waiting for parity generation.
e. Erasure Coding
Erasure coding schemes offer fully configurable amounts of protection (M can be larger than 2), but require more sophisticated algorithms than simple XOR. This results in a system that costs more than would be required for other techniques.
Most existing solutions use erasure coding systems when protection against more than 3 simultaneous failures is needed. Erasure coding techniques use more complex algorithms such as Reed-Solomon or Cauchy derivatives to generate N checksum slices based upon M slices of data.
f. Parity Protection Costs and Trade-Offs Analysis
Additional processing power is required for any parity protection mechanism, but modern processing speeds minimize this cost.
There is a trade-off between transactional latency and the storage overhead required to achieve a given level of data protection. A transactional latency penalty results from network transmission times and being dependent on the worst case disk seek times from more drives.
With parity protection, slices with 1/Nth of the payload must be written to N+M storage servers. With replica protection, whole replicas must be written to 1+M storage servers each. However, if multicast delivery is used, the whole replicas can be written in parallel with only a single network delivery, thus minimizing the transmission latency.
A complete transaction requires transmitting the data to each target server, having the target server seek to the write location, and then writing the payload.
The probable time to put N+M slices under a parity approach versus M whole replicas under a replica approach compares as follows:
Writing N +M SlicesWriting 1 + M ReplicasNetwork Transmission time(N + M)/N1 + M (unicast)1 (multicast)Disk Queuing time for targetWorst ofWorst of 1 + Mavailability and disk seek.N + MActual Disk Write each target1/N1
With the exception of the actual write times, creating whole replicas in a multicast system is faster. The maximum latency for N+M slices will never be less than for 1+M replicas. Multicast delivery only requires sending the payload once, as opposed to the overhead of sending an additional M/Nth of the payload with erasure coding.
While it would be possible to multicast a payload chunk and have some targets use the received payload to create parity protection chunks, there are no obvious methods to plan or track which parity chunks were protecting which payload chunks. Therefore, it is advantageous for a storage cluster using multicast replication to put new content as whole replicas.
Newly created content is also more likely to be retrieved, and its retrieval also benefits from using whole replicas. Having extra copies reduces the queuing delay to retrieve one of the copies. Further, only a single storage server needs to be scheduled to retrieve any given chunk.
However, eventually the relative benefits of whole replicas fade and are outweighed by the space savings of parity protection. That is, a replica scheme generally requires greater storage space than a parity scheme.
What is lacking in the prior art is the ability to utilize a replica scheme when data is first stored and while the data remains “hot” (frequently accessed) but to switch to a parity scheme when the data is no longer needed as frequently and has become “cold” (infrequently accessed), thus increasing the amount of available storage by freeing up the space previously occupied by replicas that are no longer needed. The ability to switch back from “cold” status to “hot” status is also needed, for instance, if the frequency of access to the data increases. Preferably, a solution would retain relevant portions of the prior encoding scheme to minimize the total amount of disk writes required for either transition.