Large scale storage systems are used to distribute stored data in the storage system over multiple storage elements, such as for example hard disks, or multiple components such as storage nodes comprising a plurality of such storage elements. However, as the number of storage elements in such a distributed object storage system increases, equally the probability of failure of one or more of these storage elements increases. In order to be able to cope with such failures of the storage elements of a large scale distributed storage system, it is required to introduce a certain level of redundancy into the distributed object storage system. This means that the distributed storage system must be able to cope with a failure of one or more storage elements without irrecoverable data loss. In its simplest form redundancy can be achieved by replication. This means storing multiple copies of data on multiple storage elements of the distributed storage system. In this way, when one of the storage elements storing a copy of the data object fails, this data object can still be recovered from another storage element holding another copy. Several schemes for replication are known in the art. However, in general replication is costly with regard to the storage capacity. This means that in order to survive two concurrent failures of a storage element of a distributed object storage system, at least two replica copies for each data object are required, which results in a storage capacity overhead of 200%, which means that for storing 1 GB of data objects a storage capacity of 3 GB is required. Another well-known scheme used for distributed storage systems is referred to as RAID systems of which some implementations are more efficient than replication with respect to storage capacity overhead. However, often RAID systems require a form of synchronisation of the different storage elements and require them to be of the same type. In the case of a failure of one of the storage elements, RAID systems often require immediate replacement, which needs to be followed by a costly and time consuming rebuild process in order to restore the failed storage element completely on the replacement storage element. Therefore known systems based on replication or known RAID systems are generally not configured to survive more than two concurrent storage element failures and/or require complex synchronisation between the storage elements and critical rebuild operations in case of a drive failure.
Therefore it has been proposed to use distributed object storage systems that are based on erasure encoding, such as for example described in WO2009135630, EP2469411, EP2469413, EP2793130, EP2659369, EP2659372, EP2672387, EP2725491, etc. Such a distributed object storage system stores the data object in fragments that are spread amongst the storage elements in such a way that for example a concurrent failure of six storage elements out of minimum of sixteen storage elements can be tolerated with a corresponding storage overhead of 60%, that means that 1 GB of data objects only require a storage capacity of 1.6 GB. It should be clear that in general distributed object storage systems based on erasure encoding referred to above differ considerably from for example parity based RAID 3, 4, 5 or RAID 6 like systems that can also make use of Reed-Solomon codes for dual check data computations. Such RAID like systems can at most tolerate one or two concurrent failures, and concern block-level, byte-level or bit-level striping of the data, and subsequent synchronisation between all storage elements storing such stripes of a data object or a file. The erasure encoding based distributed storage system described above generates for storage of a data object a large number of fragments, of which the number, for example hundreds or thousands, is far greater than the number of storage elements, for example ten or twenty, among which they need to be distributed. A share of this large number of fragments, for example 8000 fragments, that suffices for the recovery of the data object is distributed among a plurality of storage elements, for example ten storage elements, each of these storage elements comprising 800 of these fragments. Redundancy levels can now be flexible chosen to be greater than two, for example three, four, five, six, etc. by storing on three, four, five, six, etc. of these storage elements additionally 800 of these fragments. This can be done without a need for synchronisation between the storage elements and upon failure of a storage element there is no need for full recovery of this failed storage element to a replacement storage element. The number of fragments of a particular data object which it stored can simply be replaced by storing a corresponding number of fragments 800 to any other suitable storage element not yet storing any fragments of this data object. Fragments of different data objects of a failed storage element can be added to different other storage elements as long as they do not yet comprise fragments of the respective data object.
Additionally, in large scale distributed storage systems it is advantageous to make use of distributed object storage systems, which store data objects referenced by an object identifier, as opposed to file systems, such as for example US2002/0078244, which store files referenced by an mode or block based systems which store data in the form of data blocks referenced by a block address which have well known limitations in terms of scalability and flexibility. Distributed object storage systems in this way are able to surpass the maximum limits for storage capacity of file systems, etc. in a flexible way such that for example storage capacity can be added or removed in function of the needs, without degrading its performance as the system grows. This makes such object storage systems excellent candidates for large scale storage systems.
Current erasure encoding based distributed storage systems for large scale data storage are well equipped to efficiently store and retrieve data, however the high number of fragments spread amongst a higher number of storage elements leads to a relatively high number of input output operations at the level of the storage elements, which can become a bottleneck especially when for example a high number of relatively small data objects needs to be stored or retrieved. On the other hand, replication based systems cause a large storage overhead, especially when it is desired to implement a large scale distributed storage system which can tolerate a concurrent failure of more than two storage elements.
Therefore there still exists a need for an improved distributed object storage system that is able to overcome the abovementioned drawbacks and is able to provide for an efficient storage overhead when coping with a desired concurrent failure tolerance of storage elements which is greater than two and which optimizes the number of input and output operations at the level of the storage elements.